ACM Home Page
Please provide us with feedback. Feedback
Efficient URL caching for world wide web crawling
Full text PdfPdf (174 KB)
Source International World Wide Web Conference archive
Proceedings of the 12th international conference on World Wide Web table of contents
Budapest, Hungary
SESSION: Web crawling and measurement table of contents
Pages: 679 - 689  
Year of Publication: 2003
ISBN:1-58113-680-3
Authors
Andrei Z. Broder  IBM TJ Watson Research Center, Hawthorne, NY
Marc Najork  Microsoft Research, Mountain View, CA
Janet L. Wiener  Hewlett Packard Labs, Palo Alto, CA
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 149,   Citation Count: 8
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues   peer to peer  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775152.775247
What is a DOI?

ABSTRACT

Crawling the web is deceptively simple: the basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)-(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the "seen" URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests.Our main conclusion is that caching is very effective - in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2):78--101, 1966.
 
3
 
4
P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Trovatore: Towards a highly scalable distributed web crawler. In Poster Proceedings of the 10th International World Wide Web Conference, 2001. http://www10.org/cdrom/posters/1033.pdf.
 
5
P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler: A scalable fully distributed web crawler. In Proceedings of the 8th Australian World Wide Web Conference, July 2002. http://ausweb.scu.edu.au/aw02/papers/refereed/vigna/paper.html.
 
6
 
7
 
8
A. Z. Broder. Some applications of Rabin's fingerprinting method. In R. Capocelli, A. De Santis, and U. Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993.
 
9
 
10
M. Burner. Crawling towards eternity: Building an archive of the world wide web. Web Techniques Magazine, 2(5), May 1997.
 
11
 
12
13
 
14
 
15
F. J. Corbato. A Paging Experiment with the Multics System. Project MAC Memo MAC-M-384, Massachusetts Institute of Technology, 1968.
16
17
 
18
 
19
 
20
Google. http://www.google.com.
 
21
 
22
 
23
P. R. Jelenkovic. Asymptotic approximation of the move-to-front search cost distribution and least-recently used caching fault probabilities. Ann. Appl. Prob., 9(2):430--464, 1999. Available from http://comet.ctr.columbia.edu/epredrag/mypub/mtfRevised.ps.
 
24
 
25
 
26
B. Krishnamurthy and J. Rexford. Web Protocols and Practice. HTTP/1.1: Networking Protocols, Caching, and Traffic Measurement. Addison-Wesley, May 2001.
 
27
S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280(5360):98--100, 1998.
 
28
S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400:107--109, 1999.
 
29
M. Najork and A. Heydon. High-performance web crawling. SRC Research Report 173, Compaq Systems Research Center, Palo Alto, CA, Sept. 2001.
30
 
31
Pew Internet and American Life Project Survey. Search engines: A Pew Internet project data memo, July 2002. http://www.pewinternet.org/reports/toc.asp? Report = 64.
 
32
M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
 
33
 
34
 
35
 
36
T. Suel. Personal communication. Jan. 2003.
 
37

CITED BY  8
 
 
 

Collaborative Colleagues:
Andrei Z. Broder: colleagues
Marc Najork: colleagues
Janet L. Wiener: colleagues

Peer to Peer - Readers of this Article have also read: