Article

An adaptive crawler for locating hidden-Web entry points

Authors:

Luciano Barbosa,

Juliana FreireAuthors Info & Claims

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 441 - 450

https://doi.org/10.1145/1242572.1242632

Published: 08 May 2007 Publication History

Abstract

In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributedmakes the problem of locating them especially challenging. We deal with this problem by using the contents ofpages to focus the crawl on a topic; by prioritizing promisinglinks within the topic; and by also following links that may not lead to immediate benefit. We propose a new frameworkwhereby crawlers automatically learn patterns of promisinglinks and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning. Our experiments over real Web pages in a representativeset of domains indicate that online learning leadsto significant gains in harvest rates' the adaptive crawlers retrieve up to three times as many forms as crawlers thatuse a fixed focus strategy.

References

[1]

C. C. Aggarwal, F. Al-Garawi, and P. S. Yu. Intelligent crawling on the world wide web with arbitrary predicates. In Proceedings of WWW, pages 96--105, 2001.

Digital Library

[2]

L. Barbosa and J. Freire. Siphoning Hidden-Web Data through Keyword-Based Interfaces. In Proceedings of SBBD, pages 309--321, 2004.

[3]

L. Barbosa and J. Freire. Searching for Hidden-Web Databases. In Proceedings of WebDB, pages 1--6, 2005.

[4]

L. Barbosa and J. Freire. Combining classifiers to identify online databases. In Proceedings of WWW, 2007.

Digital Library

[5]

L. Barbosa and J. Freire. Organizing hidden-web databases by clustering visible web documents. In Proceedings of ICDE, 2007. To appear.

[6]

K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the Web. Computer Networks, 30(1-7):469--477, 1998.

Digital Library

[7]

Brightplanet's searchable databases directory. http://www.completeplanet.com.

[8]

S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In Proceedings of WWW, pages 148--159, 2002.

Digital Library

[9]

S. Chakrabarti, M. van den Berg, and B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks, 31(11-16):1623--1640, 1999.

Digital Library

[10]

K.C.C. Chang, B. He, and Z. Zhang. Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web. In Proceedings of CIDR, pages 44--55, 2005.

[11]

M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused Crawling Using Context Graphs. In Proceedings of VLDB, pages 527--534, 2000.

Digital Library

[12]

T. Dunnin. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61--74, 1993.

Digital Library

[13]

M. Galperin. The molecular biology database collection: 2005 update. Nucleic Acids Res, 33, 2005.

[14]

Google Base. http://base.google.com/.

[15]

L. Gravano, H. Garcia-Molina, and A. Tomasic. Gloss: Text-source discovery over the internet. ACM TODS, 24(2), 1999.

Digital Library

[16]

B. He and K. C.C. Chang. Statistical Schema Matching across Web Query Interfaces. In Proceedings of ACM SIGMOD, pages 217--228, 2003.

Digital Library

[17]

H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of VLDB, pages 357--368, 2003.

Digital Library

[18]

W. Hsieh, J. Madhavan, and R. Pike. Data management projects at Google. In Proceedings of ACM SIGMOD, pages 725--726, 2006.

Digital Library

[19]

H. Liu, E. Milios, and J. Janssen. Probabilistic models for focused web crawling. In Proceedings of WIDM, pages 16--22, 2004.

Digital Library

[20]

T. Mitchell. Machine Learning. McGraw Hill, 1997.

Digital Library

[21]

S. Raghavan and H. Garcia-Molina. Crawling the Hidden Web. In Proceedings of VLDB, pages 129--138, 2001.

Digital Library

[22]

J. Rennie and A. McCallum. Using Reinforcement Learning to Spider the Web Efficiently. In Proceedings of ICML, pages 335--343, 1999.

Digital Library

[23]

S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 2002.

Digital Library

[24]

S. Sizov, M. Biwer, J. Graupmann, S. Siersdorfer, M. Theobald, G. Weikum, and P. Zimmer. The BINGO! System for Information Portal Generation and Expert Web Search. In Proceedings of CIDR, 2003.

[25]

W. Wu, C. Yu, A. Doan, and W. Meng. An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web. In Proceedings of ACM SIGMOD, pages 95--106, 2004.

Digital Library

[26]

J. Xu and J. Callan. Effective retrieval with distributed collections. In Proceedings of SIGIR, pages 112--120, 1998.

Digital Library

[27]

Y. Yang and J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In International Conference on Machine Learning, pages 412--420, 1997.

Digital Library

[28]

C. Yu, K.L. Liu, W. Meng, Z. Wu, and N. Rishe. A methodology to retrieve text documents from multiple databases. TKDE, 14(6):1347--1361, 2002.

Digital Library

[29]

Z. Zheng, X. Wu, and R. Srihari. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 6(1):80--89, 2004.

Digital Library

Cited By

De Pascale DCascavilla GTamburri DVan Den Heuvel W(2024)CRATOR a CRAwler for TOR: Turning Dark Web Pages into Open Source INTelligenceComputer Security – ESORICS 202410.1007/978-3-031-70890-9_8(144-161)Online publication date: 6-Sep-2024
https://doi.org/10.1007/978-3-031-70890-9_8
Aiolli FConti MPicek SPolato MLiang KChen LLi NSchneider S(2022)On the feasibility of crawling-based attacks against recommender systemsJournal of Computer Security10.3233/JCS-21004130:4(599-621)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/JCS-210041
王安(2022)Domain Web Pages Discovery Based on Ranking MechanismHans Journal of Data Mining10.12677/HJDM.2022.12403112:04(320-333)Online publication date: 2022
https://doi.org/10.12677/HJDM.2022.124031
Show More Cited By

Index Terms

An adaptive crawler for locating hidden-Web entry points
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals

Recommendations

Design of an Ontology Based Adaptive Crawler for Hidden Web
CSNT '13: Proceedings of the 2013 International Conference on Communication Systems and Network Technologies

Deep Web is content hidden behind HTML forms. Since it represents a large portion of the structured, unstructured and dynamic data on the Web, accessing Deep-Web content has been a long challenge for the database community. This paper describes a ...
A QIIIEP based domain specific hidden web crawler
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in Technology

For context based surfing of World Wide Web in a systematic and automatic manner, a web crawler is required. The World Wide Web consists interlinked documents and resources that are easily crawled by general web crawler, known as surface web crawler. ...
Learnable topic-specific web crawler
Special issue on computational intelligence on the internet

Topic-specific web crawler collects relevant web pages of interested topics from the Internet. There are many previous researches focusing on algorithms of web page crawling. The main purpose of those algorithms is to gather as many relevant web pages ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '07: Proceedings of the 16th international conference on World Wide Web

May 2007

1382 pages

ISBN:9781595936547

DOI:10.1145/1242572

General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

WWW'07

Sponsor:

ACM

WWW'07: 16th International World Wide Web Conference

May 8 - 12, 2007

Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

94
Total Citations
View Citations
1,039
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

De Pascale DCascavilla GTamburri DVan Den Heuvel W(2024)CRATOR a CRAwler for TOR: Turning Dark Web Pages into Open Source INTelligenceComputer Security – ESORICS 202410.1007/978-3-031-70890-9_8(144-161)Online publication date: 6-Sep-2024
https://doi.org/10.1007/978-3-031-70890-9_8
Aiolli FConti MPicek SPolato MLiang KChen LLi NSchneider S(2022)On the feasibility of crawling-based attacks against recommender systemsJournal of Computer Security10.3233/JCS-21004130:4(599-621)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/JCS-210041
王安(2022)Domain Web Pages Discovery Based on Ranking MechanismHans Journal of Data Mining10.12677/HJDM.2022.12403112:04(320-333)Online publication date: 2022
https://doi.org/10.12677/HJDM.2022.124031
Sato RAl Hasan MXiong L(2022)CLEAR: A Fully User-side Image Search SystemProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557172(4970-4974)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557172
Zhang HSantos AFreire JDemartini GZuccon GCulpepper JHuang ZTong H(2021)DSDDProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482427(2527-2536)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482427
Dalvi ASiddavatam IThakkar VJain AKazi FBhirud S(2021)Link Harvesting on the Dark Web2021 IEEE Bombay Section Signature Conference (IBSSC)10.1109/IBSSC53889.2021.9673428(1-5)Online publication date: 18-Nov-2021
https://doi.org/10.1109/IBSSC53889.2021.9673428
Ismailova LWolfengagen VKosikov S(2021)A Semantic Model for Indexing in the Hidden WebProcedia Computer Science10.1016/j.procs.2021.06.043190(324-331)Online publication date: 2021
https://doi.org/10.1016/j.procs.2021.06.043
Kaur SSingh AGeetha GCheng X(2021)IHWC: intelligent hidden web crawler for harvesting data in urban domainsComplex & Intelligent Systems10.1007/s40747-021-00471-19:4(3635-3653)Online publication date: 24-Jul-2021
https://doi.org/10.1007/s40747-021-00471-1
Hintzen SLiesy YZirpins C(2021)A third-party replication service for dynamic hidden databasesService Oriented Computing and Applications10.1007/s11761-020-00313-xOnline publication date: 8-Jan-2021
https://doi.org/10.1007/s11761-020-00313-x
Deligiannis KRaftopoulou PTryfonopoulos CPlatis NVassilakis C(2020)Hydria: An Online Data Lake for Multi-Faceted Analytics in the Cultural Heritage DomainBig Data and Cognitive Computing10.3390/bdcc40200074:2(7)Online publication date: 23-Apr-2020
https://doi.org/10.3390/bdcc4020007
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten