skip to main content
10.1145/2433396.2433442acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Crawling deep web entity pages

Published: 04 February 2013 Publication History

Abstract

Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are experimentally evaluated and shown to be effective.

References

[1]
HTML 4.01 Specification, W3C recommendations. http://www.w3.org/addressing/url/4\_uri\_recommentations.html.
[2]
Z. Bar-yossef, I. Keidar, and U. Schonfeld. Do not crawl in the dust: different urls with similar text. In Proceedings of WWW, 2006.
[3]
L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In Proceedings of SBBD, 2004.
[4]
L. Barbosa and J. Freire. Searching for hidden web databases. In Proceedings of WebDB, 2005.
[5]
L. Barbosa and J. Freire. An adaptive crawler for locating hidden-web entry points. In Proceedings of WWW, 2007.
[6]
K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of SIGMOD, 2008.
[7]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW, 1997.
[8]
A. Dasgupta, R. Kumar, and A. Sasturkar. De-duping urls via rewrite rules. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, Proceedings of KDD, 2008.
[9]
J. Guo, G. Xu, X. Cheng, and H. Li. Named entity recognition in query. In Proceedings of SIGIR, Proceedings of SIGIR, 2009.
[10]
B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Accessing the deep web. Commun. ACM, 50, 2007.
[11]
M. A. Hearst. UIs for faceted navigation recent advances and remaining open problems. In Proceedings of HCIR, 2008.
[12]
A. Jain and M. Pennacchiotti. Open entity extraction from web search query logs. In Proceedings of ICCL, 2010.
[13]
H. S. Koppula, K. P. Leela, A. Agarwal, K. P. Chitrapura, S. Garg, and A. Sasturkar. Learning url patterns for webpage de-duplication. In Proceedings of WSDM, 2010.
[14]
J. Madhavan, S. R. Jeffery, S. Cohen, X. luna Dong, D. Ko, C. Yu, and A. Halevy. Web-scale data integration: You can only afford to pay as you go. In Proceedings of CIDR, 2007.
[15]
J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep web crawl. In Proceedings of VLDB, 2008.
[16]
G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proceedings of WWW, 2007.
[17]
A. Ntoulas. Downloading textual hidden web content through keyword queries. In JCDL, 2005.
[18]
M. Paşca. Weakly-supervised discovery of named entities using web search queries. In Proceedings of CIKM, 2007.
[19]
Y. Qiu and H.-P. Frei. Concept based query expansion. In Proceedings of SIGIR, 1993.
[20]
S. Raghavan and H. Garcia-Molina. Crawling the hidden web. Technical report, Stanford, 2000.
[21]
P.-N. Tan and V. Kumar. Introduction to Data Mining.
[22]
Y. Wang, J. Lu, and J. Chen. Crawling deep web using a new set covering algorithm. In Proceedings of ADMA, 2009.
[23]
P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query selection techniques for efficient crawling of structured web sources. In Proceedings of ICDE, 2006.

Cited By

View all
  • (2024)LinkedIn Alumni Profile Data Extraction2024 4th International Conference on Pervasive Computing and Social Networking (ICPCSN)10.1109/ICPCSN62568.2024.00037(174-178)Online publication date: 3-May-2024
  • (2021)Ranked Deep Web Page Detection Using Reinforcement Learning and Query OptimizationInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.202110010617:4(99-121)Online publication date: 1-Oct-2021
  • (2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
  • Show More Cited By

Index Terms

  1. Crawling deep web entity pages

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining
    February 2013
    816 pages
    ISBN:9781450318693
    DOI:10.1145/2433396
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 February 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. deep-web crawl
    2. entities
    3. web data

    Qualifiers

    • Research-article

    Conference

    WSDM 2013

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)30
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)LinkedIn Alumni Profile Data Extraction2024 4th International Conference on Pervasive Computing and Social Networking (ICPCSN)10.1109/ICPCSN62568.2024.00037(174-178)Online publication date: 3-May-2024
    • (2021)Ranked Deep Web Page Detection Using Reinforcement Learning and Query OptimizationInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.202110010617:4(99-121)Online publication date: 1-Oct-2021
    • (2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
    • (2021)SmartCrawler: A Three-Stage Ranking Based Web Crawler for Harvesting Hidden Web SourcesComputers, Materials & Continua10.32604/cmc.2021.01903069:3(2933-2948)Online publication date: 2021
    • (2021)IHWC: intelligent hidden web crawler for harvesting data in urban domainsComplex & Intelligent Systems10.1007/s40747-021-00471-19:4(3635-3653)Online publication date: 24-Jul-2021
    • (2020)Optimierung von Unternehmensbewertungen durch automatisierte Wissensidentifikation, -extraktion und -integrationInformation - Wissenschaft & Praxis10.1515/iwp-2020-211971:5-6(321-325)Online publication date: 10-Oct-2020
    • (2020)Smart Crawler for Harvesting Deep web with Multi-Classification2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT49239.2020.9225369(1-5)Online publication date: Jul-2020
    • (2020)Exploring Open Source Information for Cyber Threat Intelligence2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378220(2232-2241)Online publication date: 10-Dec-2020
    • (2019)2 Way CrawlingInternational Journal of Applied Evolutionary Computation10.4018/IJAEC.201907010510:3(34-39)Online publication date: 1-Jul-2019
    • (2019)Combining URL and HTML Features for Entity Discovery in the WebACM Transactions on the Web10.1145/336557413:4(1-27)Online publication date: 4-Dec-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media