ACM Home Page
Please provide us with feedback. Feedback
Recrawl scheduling based on information longevity
Full text PdfPdf (420 KB)
Source
International World Wide Web Conference archive
Proceeding of the 17th international conference on World Wide Web table of contents
Beijing, China
SESSION: Search: crawlers table of contents
Pages 437-446  
Year of Publication: 2008
ISBN:978-1-60558-085-2
Authors
Christopher Olston  Yahoo! Research, Santa Clara, CA, USA
Sandeep Pandey  Carnegie Mellon University, Pittsburgh, PA, USA
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 31,   Downloads (12 Months): 85,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1367497.1367557
What is a DOI?

ABSTRACT

It is crucial for a web crawler to distinguish between ephemeral and persistent content. Ephemeral content (e.g., quote of the day) is usually not worth crawling, because by the time it reaches the index it is no longer representative of the web page from which it was acquired. On the other hand, content that persists across multiple page updates (e.g., recent blog postings) may be worth acquiring, because it matches the page's true content for a sustained period of time.

In this paper we characterize the longevity of information found on the web, via both empirical measurements and a generative model that coincides with these measurements. We then develop new recrawl scheduling policies that take longevity into account. As we show via experiments over real web data, our policies obtain better freshness at lower cost, compared with previous approaches.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
4
5
 
6
E. Coffman, Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1, 1998.
7
8
9
10
 
11
The Open Directory Project. http://dmoz.org.
12
13

Collaborative Colleagues:
Christopher Olston: colleagues
Sandeep Pandey: colleagues