ABSTRACT
A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and presentation, are adversely impacted by the presence of duplicate URLs. Traditionally, the de-duping problem has been addressed by fetching and examining the content of the URL; our approach here is different. Given a set of URLs partitioned into equivalence classes based on the content (URLs in the same equivalence class have similar content), we address the problem of mining this set and learning URL rewrite rules that transform all URLs of an equivalence class to the same canonical form. These rewrite rules can then be applied to eliminate duplicates among URLs that are encountered for the first time during crawling, even without fetching their content.
In order to express such transformation rules, we propose a simple framework that is general enough to capture the most common URL rewrite patterns occurring on the web; in particular, it encapsulates the DUST (Different URLs with similar text) framework [5]. We provide an efficient algorithm for mining and learning URL rewrite rules and show that under mild assumptions, it is complete, i.e., our algorithm learns every URL rewrite rule that is correct, for an appropriate notion of correctness. We demonstrate the expressiveness of our framework and the effectiveness of our algorithm by performing a variety of extensive large-scale experiments.
- R. Ananthakrisha, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouse. In Proc. 28th VLDB, pages 586--597, 2002. Google ScholarDigital Library
- D. Angluin. Finding patterns common to a set of strings (extended abstract). In Proc. of the 11th STOC, pages 130--141, 1979. Google ScholarDigital Library
- D. Angluin. Inference of reversible languages. J. ACM, 29(3):741--765, 1982. Google ScholarDigital Library
- D. Angluin and C. H. Smith. Inductive inference: Theory and methods. ACM Comput. Surv., 15(3):237--269, 1983. Google ScholarDigital Library
- Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the DUST: different urls with similar text. In Proc. 16th WWW, pages 111--120, 2007. Google ScholarDigital Library
- M. Bognar. A survey of abstract rewriting, 1995. www.di.ubi.pt/~desousa/1998-1999/logica/mb.ps.Google Scholar
- A. Broder. On the resemblance and containment of documents. In SEQS: Sequences '91, 1998. Google ScholarDigital Library
- A. Broder, S. C. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997. Google ScholarDigital Library
- M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th STOC, pages 380--388, 2002. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Motwani. Robust idenfication of fuzzy duplicates. In Proc. 21st ICDE, pages 865--876, 2005. Google ScholarDigital Library
- Z. Chen, D. V. Kalashnikov, and S. Mehrotra. Adaptive graphical approach to entity resolution. In Proc. of the ACM/IEEE Joint Conference on Digital Libraries, pages 204--213, 2007. Google ScholarDigital Library
- D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In Proc. of the 1st Conference on Latin American Web Congress, page 37, 2003. Google ScholarDigital Library
- H. Garcia-Molina. Pair-wise entity resolution: Overview and challenges. In Proc. CIKM, 2006. Google ScholarDigital Library
- M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proc. 29th SIGIR, pages 284--291, 2006. Google ScholarDigital Library
- G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. of the 16th International Conference on World Wide Web, pages 141--150, 2007. Google ScholarDigital Library
- M. Najork. Systems and methods for inferring uniform resource locator (URL) normalization rules, 2006. US Patent Application Publication, 2006/0218143.Google Scholar
- A. Pereira, R. Baeza-Yates, and N. Ziviani. Where and how duplicates occur in the web. In Proc. of the 4th Latin American Web Congress, pages 127--134, 2006. Google ScholarDigital Library
Index Terms
De-duping URLs via rewrite rules
Recommendations
De-duping URLs with Sequence-to-Sequence Neural Networks
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalMany URLs on the Internet point to identical contents, which increase the burden of web crawlers. Techniques that detect such URLs (known as URL de-duping) can greatly save resources such as bandwidth and storage for crawlers. Traditional de-duping ...
A pattern tree-based approach to learning URL normalization rules
WWW '10: Proceedings of the 19th international conference on World wide webDuplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs to a canonical form using a set of rewrite rules. Nowadays URL ...
Identifying Equivalent URLs Using URL Signatures
SITIS '08: Proceedings of the 2008 IEEE International Conference on Signal Image Technology and Internet Based SystemsIn the standard URL normalization mechanism, URLs are normalized syntactically by a set of predefined steps. In this paper, we propose to enhance the standard URL normalization by incorporating the semantically meaningful metadata of the Web pages. The ...
Comments