skip to main content
10.1145/2837185.2837208acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

Towards complete coverage in focused web harvesting

Published: 11 December 2015 Publication History

Abstract

With the goal of harvesting all information about a given entity, in this paper, we try to harvest all matching documents for a given query submitted on a search engine. The objective is to retrieve all information about for instance "Michael Jackson", "Islamic State", or "FC Barcelona" from indexed data in search engines, or hidden data behind web forms, using a minimum number of queries. Policies of web search engines usually do not allow accessing all of the matching query search results for a given query. They limit the number of returned documents and the number of user requests. These limitations are also applied in deep web sources, for instance in social networks like Twitter. In this work, we propose a new approach which automatically collects information related to a given query from a search engine, given the search engine's limitations. The approach minimizes the number of queries that need to be sent by analysing the retrieved results and combining this analysed information with information from a large external corpus. The new approach outperforms existing approaches when tested on Google, measuring the total number of unique documents found per query.

References

[1]
Manuel Álvarez, Juan Raposo, Alberto Pan, Fidel Cacheda, Fernando Bellas, and Víctor Carneiro. Deepbot: a focused crawler for accessing hidden web content. In Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07), DEECS '07, pages 18--25, New York, NY, USA, 2007. ACM.
[2]
Ziv Bar-Yossef and Maxim Gurevich. Efficient search engine measurements. Proceedings of the 16th international conference on World Wide Web, pages 401--410, 2007.
[3]
Luciano Barbosa and Juliana Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, pages 309--321, 2004.
[4]
Krishna Bharat and Andrei Broder. A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst., 30:379--388, April 1998.
[5]
Michael Cafarella. Extracting and Querying a Comprehensive Web Database. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2009.
[6]
James P. Callan and Margaret E. Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst., 19(2):97--130, 2001.
[7]
Guihong Cao, Jian-Yun Nie, Jianfeng Gao, and Stephen Robertson. Selecting good expansion terms for pseudo-relevance feedback. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '08, pages 243--250, New York, NY, USA, 2008. ACM.
[8]
Claudio Carpineto and Giovanni Romano. A survey of automatic query expansion in information retrieval. ACM Comput. Surv., 44(1):1:1--1:50, January 2012.
[9]
Kevyn Collins-Thompson and Jamie Callan. Estimation and use of uncertainty in pseudo-relevance feedback. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '07, pages 303--310, New York, NY, USA, 2007. ACM.
[10]
Google. Google custom search. https://developers.google.com/custom-search/, 2015.
[11]
Ben He and Iadh Ounis. Combining fields for query expansion and adaptive query expansion. Inf. Process. Manage., 43(5):1294--1307, September 2007.
[12]
Yeye He, Dong Xin, Venkatesh Ganti, Sriram Rajaraman, and Nirav Shah. Crawling deep web entity pages. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM '13, pages 355--364, New York, NY, USA, 2013. ACM.
[13]
Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen. Size estimation of non-cooperative data collections. IIWAS '12, pages 239--246, New York, NY, USA, 2012. ACM.
[14]
Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen. Harvesting all matching information to a given query from a deep website. In 1st International Workshop on Knowledge Discovery on the Web (KDWEB'15), CEUR Workshop Proceedings, Aachen, 2015. (in press).
[15]
Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy. Google's Deep Web crawl. Proc. VLDB Endow., 1(2):1241--1252, August 2008.
[16]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
[17]
Filippo Menczer, Gautam Pant, and Padmini Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology, 4:http://dollar.biz.ui, 2004.
[18]
The Lemur Project. A dataset to support research on information retrieval and related human language technologies. http://lemurproject.org/clueweb09.php, 2014.
[19]
Milad Shokouhi, Justin Zobel, Falk Scholer, and Seyed M. M. Tahaghoghi. Capturing collection size for distributed non-cooperative retrieval. In SIGIR, pages 316--323, 2006.
[20]
Sergej Sizov, Martin Theobald, Stefan Siersdorfer, Gerhard Weikum, Jens Graupmann, Michael Biwer, and Patrick Zimmer. The bingo! system for information portal generation and expert web search. In CIDR, 2003.

Cited By

View all
  • (2016)Efficient web harvesting strategies for monitoring deep web contentProceedings of the 18th International Conference on Information Integration and Web-based Applications and Services10.1145/3011141.3011198(389-393)Online publication date: 28-Nov-2016

Index Terms

  1. Towards complete coverage in focused web harvesting

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    iiWAS '15: Proceedings of the 17th International Conference on Information Integration and Web-based Applications & Services
    December 2015
    704 pages
    ISBN:9781450334914
    DOI:10.1145/2837185
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 December 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data coverage
    2. data extraction
    3. deep web
    4. web harvester
    5. web mining
    6. world wide web

    Qualifiers

    • Research-article

    Conference

    iiWAS '15

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2016)Efficient web harvesting strategies for monitoring deep web contentProceedings of the 18th International Conference on Information Integration and Web-based Applications and Services10.1145/3011141.3011198(389-393)Online publication date: 28-Nov-2016

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media