research-article

Crawling deep web entity pages

Authors:

Venkatesh Ganti,

Sriram Rajaraman,

Nirav ShahAuthors Info & Claims

WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

Pages 355 - 364

https://doi.org/10.1145/2433396.2433442

Published: 04 February 2013 Publication History

Abstract

Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are experimentally evaluated and shown to be effective.

References

[1]

HTML 4.01 Specification, W3C recommendations. http://www.w3.org/addressing/url/4\_uri\_recommentations.html.

[2]

Z. Bar-yossef, I. Keidar, and U. Schonfeld. Do not crawl in the dust: different urls with similar text. In Proceedings of WWW, 2006.

Digital Library

[3]

L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In Proceedings of SBBD, 2004.

[4]

L. Barbosa and J. Freire. Searching for hidden web databases. In Proceedings of WebDB, 2005.

[5]

L. Barbosa and J. Freire. An adaptive crawler for locating hidden-web entry points. In Proceedings of WWW, 2007.

Digital Library

[6]

K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of SIGMOD, 2008.

Digital Library

[7]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW, 1997.

Digital Library

[8]

A. Dasgupta, R. Kumar, and A. Sasturkar. De-duping urls via rewrite rules. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, Proceedings of KDD, 2008.

Digital Library

[9]

J. Guo, G. Xu, X. Cheng, and H. Li. Named entity recognition in query. In Proceedings of SIGIR, Proceedings of SIGIR, 2009.

Digital Library

[10]

B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Accessing the deep web. Commun. ACM, 50, 2007.

Digital Library

[11]

M. A. Hearst. UIs for faceted navigation recent advances and remaining open problems. In Proceedings of HCIR, 2008.

[12]

A. Jain and M. Pennacchiotti. Open entity extraction from web search query logs. In Proceedings of ICCL, 2010.

Digital Library

[13]

H. S. Koppula, K. P. Leela, A. Agarwal, K. P. Chitrapura, S. Garg, and A. Sasturkar. Learning url patterns for webpage de-duplication. In Proceedings of WSDM, 2010.

Digital Library

[14]

J. Madhavan, S. R. Jeffery, S. Cohen, X. luna Dong, D. Ko, C. Yu, and A. Halevy. Web-scale data integration: You can only afford to pay as you go. In Proceedings of CIDR, 2007.

[15]

J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep web crawl. In Proceedings of VLDB, 2008.

Digital Library

[16]

G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proceedings of WWW, 2007.

Digital Library

[17]

A. Ntoulas. Downloading textual hidden web content through keyword queries. In JCDL, 2005.

Digital Library

[18]

M. Paşca. Weakly-supervised discovery of named entities using web search queries. In Proceedings of CIKM, 2007.

Digital Library

[19]

Y. Qiu and H.-P. Frei. Concept based query expansion. In Proceedings of SIGIR, 1993.

Digital Library

[20]

S. Raghavan and H. Garcia-Molina. Crawling the hidden web. Technical report, Stanford, 2000.

[21]

P.-N. Tan and V. Kumar. Introduction to Data Mining.

Digital Library

[22]

Y. Wang, J. Lu, and J. Chen. Crawling deep web using a new set covering algorithm. In Proceedings of ADMA, 2009.

Digital Library

[23]

P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query selection techniques for efficient crawling of structured web sources. In Proceedings of ICDE, 2006.

Digital Library

Cited By

Lade SBillade AChandrapatle AChenna SChinchalpalle G(2024)LinkedIn Alumni Profile Data Extraction2024 4th International Conference on Pervasive Computing and Social Networking (ICPCSN)10.1109/ICPCSN62568.2024.00037(174-178)Online publication date: 3-May-2024
https://doi.org/10.1109/ICPCSN62568.2024.00037
Madan KBhatia R(2021)Ranked Deep Web Page Detection Using Reinforcement Learning and Query OptimizationInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.202110010617:4(99-121)Online publication date: 1-Oct-2021
https://dl.acm.org/doi/10.4018/IJSWIS.2021100106
Koloveas PChantzios TAlevizopoulou SSkiadopoulos STryfonopoulos C(2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
https://doi.org/10.3390/electronics10070818
Show More Cited By

Index Terms

Crawling deep web entity pages
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Using Web Pages Dynamicity to Prioritise Web Crawling
MLMI '19: Proceedings of the 2019 2nd International Conference on Machine Learning and Machine Intelligence

Web crawling is a process performed to collect web pages from the web, in order to be indexed and used for displaying the search results according to users' requirements. In addition, web crawlers must continually revisit web pages, to keep the search ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

February 2013

816 pages

ISBN:9781450318693

DOI:10.1145/2433396

General Chairs:
Stefano Leonardi
Sapienza University of Rome, Italy
,
Alessandro Panconesi
Sapienza University of Rome, Italy
,
Program Chairs:
Paolo Ferragina
University of Pisa, Italy
,
Aristides Gionis
Yahoo! Research, Barcelona, Spain

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 February 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM 2013

Sponsor:

WSDM 2013: Sixth ACM International Conference on Web Search and Data Mining

February 4 - 8, 2013

Rome, Italy

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
1,050
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)7

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lade SBillade AChandrapatle AChenna SChinchalpalle G(2024)LinkedIn Alumni Profile Data Extraction2024 4th International Conference on Pervasive Computing and Social Networking (ICPCSN)10.1109/ICPCSN62568.2024.00037(174-178)Online publication date: 3-May-2024
https://doi.org/10.1109/ICPCSN62568.2024.00037
Madan KBhatia R(2021)Ranked Deep Web Page Detection Using Reinforcement Learning and Query OptimizationInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.202110010617:4(99-121)Online publication date: 1-Oct-2021
https://dl.acm.org/doi/10.4018/IJSWIS.2021100106
Koloveas PChantzios TAlevizopoulou SSkiadopoulos STryfonopoulos C(2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
https://doi.org/10.3390/electronics10070818
Kaur SSingh AGeetha GMasud MA. Alzain M(2021)SmartCrawler: A Three-Stage Ranking Based Web Crawler for Harvesting Hidden Web SourcesComputers, Materials & Continua10.32604/cmc.2021.01903069:3(2933-2948)Online publication date: 2021
https://doi.org/10.32604/cmc.2021.019030
Kaur SSingh AGeetha GCheng X(2021)IHWC: intelligent hidden web crawler for harvesting data in urban domainsComplex & Intelligent Systems10.1007/s40747-021-00471-19:4(3635-3653)Online publication date: 24-Jul-2021
https://doi.org/10.1007/s40747-021-00471-1
Weichselbraun AKuntschik PHörler S(2020)Optimierung von Unternehmensbewertungen durch automatisierte Wissensidentifikation, -extraktion und -integrationInformation - Wissenschaft & Praxis10.1515/iwp-2020-211971:5-6(321-325)Online publication date: 10-Oct-2020
https://doi.org/10.1515/iwp-2020-2119
Khare ADalvi AKazi F(2020)Smart Crawler for Harvesting Deep web with Multi-Classification2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT49239.2020.9225369(1-5)Online publication date: Jul-2020
https://doi.org/10.1109/ICCCNT49239.2020.9225369
Adewopo VGonen BAdewopo F(2020)Exploring Open Source Information for Cyber Threat Intelligence2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378220(2232-2241)Online publication date: 10-Dec-2020
https://doi.org/10.1109/BigData50022.2020.9378220
Deshmukh M(2019)2 Way CrawlingInternational Journal of Applied Evolutionary Computation10.4018/IJAEC.201907010510:3(34-39)Online publication date: 1-Jul-2019
https://doi.org/10.4018/IJAEC.2019070105
Manica EDorneles CGalante R(2019)Combining URL and HTML Features for Entity Discovery in the WebACM Transactions on the Web10.1145/336557413:4(1-27)Online publication date: 4-Dec-2019
https://dl.acm.org/doi/10.1145/3365574
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten