research-article

Towards complete coverage in focused web harvesting

Authors:

Mohammadreza Khelghati,

Djoerd Hiemstra,

Maurice van KeulenAuthors Info & Claims

iiWAS '15: Proceedings of the 17th International Conference on Information Integration and Web-based Applications & Services

Article No.: 65, Pages 1 - 9

https://doi.org/10.1145/2837185.2837208

Published: 11 December 2015 Publication History

Abstract

With the goal of harvesting all information about a given entity, in this paper, we try to harvest all matching documents for a given query submitted on a search engine. The objective is to retrieve all information about for instance "Michael Jackson", "Islamic State", or "FC Barcelona" from indexed data in search engines, or hidden data behind web forms, using a minimum number of queries. Policies of web search engines usually do not allow accessing all of the matching query search results for a given query. They limit the number of returned documents and the number of user requests. These limitations are also applied in deep web sources, for instance in social networks like Twitter. In this work, we propose a new approach which automatically collects information related to a given query from a search engine, given the search engine's limitations. The approach minimizes the number of queries that need to be sent by analysing the retrieved results and combining this analysed information with information from a large external corpus. The new approach outperforms existing approaches when tested on Google, measuring the total number of unique documents found per query.

References

[1]

Manuel Álvarez, Juan Raposo, Alberto Pan, Fidel Cacheda, Fernando Bellas, and Víctor Carneiro. Deepbot: a focused crawler for accessing hidden web content. In Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07), DEECS '07, pages 18--25, New York, NY, USA, 2007. ACM.

[2]

Ziv Bar-Yossef and Maxim Gurevich. Efficient search engine measurements. Proceedings of the 16th international conference on World Wide Web, pages 401--410, 2007.

Digital Library

[3]

Luciano Barbosa and Juliana Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, pages 309--321, 2004.

[4]

Krishna Bharat and Andrei Broder. A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst., 30:379--388, April 1998.

Digital Library

[5]

Michael Cafarella. Extracting and Querying a Comprehensive Web Database. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2009.

[6]

James P. Callan and Margaret E. Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst., 19(2):97--130, 2001.

Digital Library

[7]

Guihong Cao, Jian-Yun Nie, Jianfeng Gao, and Stephen Robertson. Selecting good expansion terms for pseudo-relevance feedback. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '08, pages 243--250, New York, NY, USA, 2008. ACM.

Digital Library

[8]

Claudio Carpineto and Giovanni Romano. A survey of automatic query expansion in information retrieval. ACM Comput. Surv., 44(1):1:1--1:50, January 2012.

Digital Library

[9]

Kevyn Collins-Thompson and Jamie Callan. Estimation and use of uncertainty in pseudo-relevance feedback. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '07, pages 303--310, New York, NY, USA, 2007. ACM.

Digital Library

[10]

Google. Google custom search. https://developers.google.com/custom-search/, 2015.

[11]

Ben He and Iadh Ounis. Combining fields for query expansion and adaptive query expansion. Inf. Process. Manage., 43(5):1294--1307, September 2007.

Digital Library

[12]

Yeye He, Dong Xin, Venkatesh Ganti, Sriram Rajaraman, and Nirav Shah. Crawling deep web entity pages. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM '13, pages 355--364, New York, NY, USA, 2013. ACM.

Digital Library

[13]

Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen. Size estimation of non-cooperative data collections. IIWAS '12, pages 239--246, New York, NY, USA, 2012. ACM.

Digital Library

[14]

Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen. Harvesting all matching information to a given query from a deep website. In 1^st International Workshop on Knowledge Discovery on the Web (KDWEB'15), CEUR Workshop Proceedings, Aachen, 2015. (in press).

[15]

Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy. Google's Deep Web crawl. Proc. VLDB Endow., 1(2):1241--1252, August 2008.

Digital Library

[16]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.

Digital Library

[17]

Filippo Menczer, Gautam Pant, and Padmini Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology, 4:http://dollar.biz.ui, 2004.

[18]

The Lemur Project. A dataset to support research on information retrieval and related human language technologies. http://lemurproject.org/clueweb09.php, 2014.

[19]

Milad Shokouhi, Justin Zobel, Falk Scholer, and Seyed M. M. Tahaghoghi. Capturing collection size for distributed non-cooperative retrieval. In SIGIR, pages 316--323, 2006.

Digital Library

[20]

Sergej Sizov, Martin Theobald, Stefan Siersdorfer, Gerhard Weikum, Jens Graupmann, Michael Biwer, and Patrick Zimmer. The bingo! system for information portal generation and expert web search. In CIDR, 2003.

Cited By

Khelghati MHiemstra Dvan Keulen MAnderst-Kotsis G(2016)Efficient web harvesting strategies for monitoring deep web contentProceedings of the 18th International Conference on Information Integration and Web-based Applications and Services10.1145/3011141.3011198(389-393)Online publication date: 28-Nov-2016
https://dl.acm.org/doi/10.1145/3011141.3011198

Index Terms

Towards complete coverage in focused web harvesting
1. Information systems
  1. Information systems applications

Recommendations

Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Mining the web with hierarchical crawlers – a resource sharing based crawling approach

An important component of any web search engine is its crawler, which is also known as robot or spider. An efficient set of crawlers make any search engine more powerful, apart from its other measures of performance, such as its ranking algorithm, ...
Ranking Pages of Clustered Users using Weighted Page Rank Algorithm with User Access Period

The World Wide Web comprises billions of web pages and a tremendous amount of information accessible inside of web pages. To recover obliged data from the World Wide Web, search engines perform number of tasks in light of their separate structural ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

iiWAS '15: Proceedings of the 17th International Conference on Information Integration and Web-based Applications & Services

December 2015

704 pages

ISBN:9781450334914

DOI:10.1145/2837185

General Chair:
Gabriele Anderst-Kotsis
Johannes Kepler University Linz, Austria
,
Program Chair:
Maria Indrawan-Santiago
Monash University, Australia

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

iiWAS '15

iiWAS '15: The 17th International Conference on Information Integration and Web-based Application & Services

December 11 - 13, 2015

Brussels, Belgium

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
35
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Khelghati MHiemstra Dvan Keulen MAnderst-Kotsis G(2016)Efficient web harvesting strategies for monitoring deep web contentProceedings of the 18th International Conference on Information Integration and Web-based Applications and Services10.1145/3011141.3011198(389-393)Online publication date: 28-Nov-2016
https://dl.acm.org/doi/10.1145/3011141.3011198

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten