skip to main content
10.1145/1242572.1242632acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

An adaptive crawler for locating hidden-Web entry points

Published: 08 May 2007 Publication History

Abstract

In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributedmakes the problem of locating them especially challenging. We deal with this problem by using the contents ofpages to focus the crawl on a topic; by prioritizing promisinglinks within the topic; and by also following links that may not lead to immediate benefit. We propose a new frameworkwhereby crawlers automatically learn patterns of promisinglinks and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning. Our experiments over real Web pages in a representativeset of domains indicate that online learning leadsto significant gains in harvest rates' the adaptive crawlers retrieve up to three times as many forms as crawlers thatuse a fixed focus strategy.

References

[1]
C. C. Aggarwal, F. Al-Garawi, and P. S. Yu. Intelligent crawling on the world wide web with arbitrary predicates. In Proceedings of WWW, pages 96--105, 2001.
[2]
L. Barbosa and J. Freire. Siphoning Hidden-Web Data through Keyword-Based Interfaces. In Proceedings of SBBD, pages 309--321, 2004.
[3]
L. Barbosa and J. Freire. Searching for Hidden-Web Databases. In Proceedings of WebDB, pages 1--6, 2005.
[4]
L. Barbosa and J. Freire. Combining classifiers to identify online databases. In Proceedings of WWW, 2007.
[5]
L. Barbosa and J. Freire. Organizing hidden-web databases by clustering visible web documents. In Proceedings of ICDE, 2007. To appear.
[6]
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the Web. Computer Networks, 30(1-7):469--477, 1998.
[7]
Brightplanet's searchable databases directory. http://www.completeplanet.com.
[8]
S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In Proceedings of WWW, pages 148--159, 2002.
[9]
S. Chakrabarti, M. van den Berg, and B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks, 31(11-16):1623--1640, 1999.
[10]
K.C.C. Chang, B. He, and Z. Zhang. Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web. In Proceedings of CIDR, pages 44--55, 2005.
[11]
M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused Crawling Using Context Graphs. In Proceedings of VLDB, pages 527--534, 2000.
[12]
T. Dunnin. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61--74, 1993.
[13]
M. Galperin. The molecular biology database collection: 2005 update. Nucleic Acids Res, 33, 2005.
[14]
Google Base. http://base.google.com/.
[15]
L. Gravano, H. Garcia-Molina, and A. Tomasic. Gloss: Text-source discovery over the internet. ACM TODS, 24(2), 1999.
[16]
B. He and K. C.C. Chang. Statistical Schema Matching across Web Query Interfaces. In Proceedings of ACM SIGMOD, pages 217--228, 2003.
[17]
H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of VLDB, pages 357--368, 2003.
[18]
W. Hsieh, J. Madhavan, and R. Pike. Data management projects at Google. In Proceedings of ACM SIGMOD, pages 725--726, 2006.
[19]
H. Liu, E. Milios, and J. Janssen. Probabilistic models for focused web crawling. In Proceedings of WIDM, pages 16--22, 2004.
[20]
T. Mitchell. Machine Learning. McGraw Hill, 1997.
[21]
S. Raghavan and H. Garcia-Molina. Crawling the Hidden Web. In Proceedings of VLDB, pages 129--138, 2001.
[22]
J. Rennie and A. McCallum. Using Reinforcement Learning to Spider the Web Efficiently. In Proceedings of ICML, pages 335--343, 1999.
[23]
S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 2002.
[24]
S. Sizov, M. Biwer, J. Graupmann, S. Siersdorfer, M. Theobald, G. Weikum, and P. Zimmer. The BINGO! System for Information Portal Generation and Expert Web Search. In Proceedings of CIDR, 2003.
[25]
W. Wu, C. Yu, A. Doan, and W. Meng. An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web. In Proceedings of ACM SIGMOD, pages 95--106, 2004.
[26]
J. Xu and J. Callan. Effective retrieval with distributed collections. In Proceedings of SIGIR, pages 112--120, 1998.
[27]
Y. Yang and J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In International Conference on Machine Learning, pages 412--420, 1997.
[28]
C. Yu, K.L. Liu, W. Meng, Z. Wu, and N. Rishe. A methodology to retrieve text documents from multiple databases. TKDE, 14(6):1347--1361, 2002.
[29]
Z. Zheng, X. Wu, and R. Srihari. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 6(1):80--89, 2004.

Cited By

View all
  • (2024)CRATOR a CRAwler for TOR: Turning Dark Web Pages into Open Source INTelligenceComputer Security – ESORICS 202410.1007/978-3-031-70890-9_8(144-161)Online publication date: 6-Sep-2024
  • (2022)On the feasibility of crawling-based attacks against recommender systemsJournal of Computer Security10.3233/JCS-21004130:4(599-621)Online publication date: 1-Jan-2022
  • (2022)Domain Web Pages Discovery Based on Ranking MechanismHans Journal of Data Mining10.12677/HJDM.2022.12403112:04(320-333)Online publication date: 2022
  • Show More Cited By

Index Terms

  1. An adaptive crawler for locating hidden-Web entry points

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '07: Proceedings of the 16th international conference on World Wide Web
      May 2007
      1382 pages
      ISBN:9781595936547
      DOI:10.1145/1242572
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 May 2007

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. HiddenWeb
      2. learning classifiers
      3. online learning
      4. web crawling strategies

      Qualifiers

      • Article

      Conference

      WWW'07
      Sponsor:
      WWW'07: 16th International World Wide Web Conference
      May 8 - 12, 2007
      Alberta, Banff, Canada

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)16
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)CRATOR a CRAwler for TOR: Turning Dark Web Pages into Open Source INTelligenceComputer Security – ESORICS 202410.1007/978-3-031-70890-9_8(144-161)Online publication date: 6-Sep-2024
      • (2022)On the feasibility of crawling-based attacks against recommender systemsJournal of Computer Security10.3233/JCS-21004130:4(599-621)Online publication date: 1-Jan-2022
      • (2022)Domain Web Pages Discovery Based on Ranking MechanismHans Journal of Data Mining10.12677/HJDM.2022.12403112:04(320-333)Online publication date: 2022
      • (2022)CLEAR: A Fully User-side Image Search SystemProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557172(4970-4974)Online publication date: 17-Oct-2022
      • (2021)DSDDProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482427(2527-2536)Online publication date: 26-Oct-2021
      • (2021)Link Harvesting on the Dark Web2021 IEEE Bombay Section Signature Conference (IBSSC)10.1109/IBSSC53889.2021.9673428(1-5)Online publication date: 18-Nov-2021
      • (2021)A Semantic Model for Indexing in the Hidden WebProcedia Computer Science10.1016/j.procs.2021.06.043190(324-331)Online publication date: 2021
      • (2021)IHWC: intelligent hidden web crawler for harvesting data in urban domainsComplex & Intelligent Systems10.1007/s40747-021-00471-19:4(3635-3653)Online publication date: 24-Jul-2021
      • (2021)A third-party replication service for dynamic hidden databasesService Oriented Computing and Applications10.1007/s11761-020-00313-xOnline publication date: 8-Jan-2021
      • (2020)Hydria: An Online Data Lake for Multi-Faceted Analytics in the Cultural Heritage DomainBig Data and Cognitive Computing10.3390/bdcc40200074:2(7)Online publication date: 23-Apr-2020
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media