ACM Home Page
Please provide us with feedback. Feedback
Do not crawl in the DUST: different URLs with similar text
Full text PdfPdf (83 KB)
Source International World Wide Web Conference archive
Proceedings of the 15th international conference on World Wide Web table of contents
Edinburgh, Scotland
POSTER SESSION: Browsers and UI, web engineering, hypermedia & multimedia, security, and accessibility table of contents
Pages: 1015 - 1016  
Year of Publication: 2006
ISBN:1-59593-323-9
Authors
Uri Schonfeld  Technion, Haifa, Israel
Ziv Bar-Yossef  Technion, Haifa, Israel
Idit Keidar  Technion, Haifa, Israel
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 55,   Citation Count: 2
Additional Information:

abstract   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1135777.1135992
What is a DOI?

ABSTRACT

We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, translates URLs to some canonical form, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering dust; that is, for discovering rules for transforming a given URL to others that are likely to have similar content. DustBuster is able to detect dust effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from this information to increase the effectiveness of crawling, reduce indexing overhead as well as improve the quality of popularity statistics such as PageRank.



Collaborative Colleagues:
Uri Schonfeld: colleagues
Ziv Bar-Yossef: colleagues
Idit Keidar: colleagues