demonstration

Not so creepy crawler: easy crawler generation with standard xml queries

Authors:

Franziska von dem Bussche,

Klara Weiand,

Benedikt Linse,

Tim Furche,

François BryAuthors Info & Claims

WWW '10: Proceedings of the 19th international conference on World wide web

Pages 1305 - 1308

https://doi.org/10.1145/1772690.1772908

Published: 26 April 2010 Publication History

Get Access

Abstract

Web crawlers are increasingly used for focused tasks such as the extraction of data from Wikipedia or the analysis of social networks like last.fm. In these cases, pages are far more uniformly structured than in the general Web and thus crawlers can use the structure of Web pages for more precise data extraction and more expressive analysis.

In this demonstration, we present a focused, structure-based crawler generator, the "Not so Creepy Crawler" (nc2 ). What sets nc2 apart, is that all analysis and decision tasks of the crawling process are delegated to an (arbitrary) XML query engine such as XQuery or Xcerpt. Customizing crawlers just means writing (declarative) XML queries that can access the currently crawled document as well as the metadata of the crawl process. We identify four types of queries that together sufice to realize a wide variety of focused crawlers.

We demonstrate nc2 with two applications: The first extracts data about cities from Wikipedia with a customizable set of attributes for selecting and reporting these cities.

It illustrates the power of nc2 where data extraction from Wiki-style, fairly homogeneous knowledge sites is required.

In contrast, the second use case demonstrates how easy nc2 makes even complex analysis tasks on social networking sites, here exemplified by last.fm.

References

[1]

A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, 2003.

Digital Library

Google Scholar

[2]

R. Baumgartner, S. Flesca, and G. Gottlob. Visual web information extraction with Lixto. In VLDB, 2001.

Digital Library

Google Scholar

[3]

F. Bry, T. Furche, B. Linse, A. Pohl, A. Weinzierl, and O. Yestekhina. Four lessons in versatility or how query languages adapt to the web. In F. Bry and J. Maluszynski, Semantic Techniques for the Web, LNCS 5500, 2009.

Digital Library

Google Scholar

[4]

S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific web resource discovery. In WWW, 1999.

Digital Library

Google Scholar

[5]

M. El-Sayed, E. A. Rundensteiner, and M. Mani. Incremental maintenance of materialized XQuery views. In ICDE, 2006.

Digital Library

Google Scholar

[6]

S. Schaffert and F. Bry. Querying the Web Reconsidered: A Practical Introduction to Xcerpt. In Extreme Markup Languages, 2004.

Google Scholar

[7]

M. L. A. Vidal, A. S. da Silva, E. S. de Moura, and J. M. B. Cavalcanti. Structure-driven crawler generation by example. In SIGIR Conf., 2006.

Digital Library

Google Scholar

Index Terms

Not so creepy crawler: easy crawler generation with standard xml queries
1. Information systems
  1. Information retrieval

Recommendations

Learnable topic-specific web crawler
Special issue on computational intelligence on the internet

Topic-specific web crawler collects relevant web pages of interested topics from the Internet. There are many previous researches focusing on algorithms of web page crawling. The main purpose of those algorithms is to gather as many relevant web pages ...
A QIIIEP based domain specific hidden web crawler
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in Technology

For context based surfing of World Wide Web in a systematic and automatic manner, a web crawler is required. The World Wide Web consists interlinked documents and resources that are easily crawled by general web crawler, known as surface web crawler. ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...

Comments

Information & Contributors

Information

Published In

WWW '10: Proceedings of the 19th international conference on World wide web

April 2010

1407 pages

ISBN:9781605587998

DOI:10.1145/1772690

General Chairs:
Michael Rappa
North Carolina State University, USA
,
Paul Jones
University of North Carolina at Chapel Hill, USA
,
Program Chairs:
Juliana Freire
University of Utah, USA
,
Soumen Chakrabarti
Indian Institute of Technology, India

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 April 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Demonstration

Conference

WWW '10

WWW '10: The 19th International World Wide Web Conference

April 26 - 30, 2010

North Carolina, Raleigh, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
345
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

EPUB

View this article in ePub.

ePub

Abstract

References

Index Terms

Recommendations

Learnable topic-specific web crawler

A QIIIEP based domain specific hidden web crawler

Current challenges in web crawling

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

EPUB

Share

Share this Publication link

Share on social media

Affiliations