skip to main content
10.1145/1401890.1401917acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

De-duping URLs via rewrite rules

Published:24 August 2008Publication History

ABSTRACT

A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and presentation, are adversely impacted by the presence of duplicate URLs. Traditionally, the de-duping problem has been addressed by fetching and examining the content of the URL; our approach here is different. Given a set of URLs partitioned into equivalence classes based on the content (URLs in the same equivalence class have similar content), we address the problem of mining this set and learning URL rewrite rules that transform all URLs of an equivalence class to the same canonical form. These rewrite rules can then be applied to eliminate duplicates among URLs that are encountered for the first time during crawling, even without fetching their content.

In order to express such transformation rules, we propose a simple framework that is general enough to capture the most common URL rewrite patterns occurring on the web; in particular, it encapsulates the DUST (Different URLs with similar text) framework [5]. We provide an efficient algorithm for mining and learning URL rewrite rules and show that under mild assumptions, it is complete, i.e., our algorithm learns every URL rewrite rule that is correct, for an appropriate notion of correctness. We demonstrate the expressiveness of our framework and the effectiveness of our algorithm by performing a variety of extensive large-scale experiments.

References

  1. R. Ananthakrisha, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouse. In Proc. 28th VLDB, pages 586--597, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Angluin. Finding patterns common to a set of strings (extended abstract). In Proc. of the 11th STOC, pages 130--141, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Angluin. Inference of reversible languages. J. ACM, 29(3):741--765, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Angluin and C. H. Smith. Inductive inference: Theory and methods. ACM Comput. Surv., 15(3):237--269, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the DUST: different urls with similar text. In Proc. 16th WWW, pages 111--120, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Bognar. A survey of abstract rewriting, 1995. www.di.ubi.pt/~desousa/1998-1999/logica/mb.ps.Google ScholarGoogle Scholar
  7. A. Broder. On the resemblance and containment of documents. In SEQS: Sequences '91, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Broder, S. C. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th STOC, pages 380--388, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chaudhuri, V. Ganti, and R. Motwani. Robust idenfication of fuzzy duplicates. In Proc. 21st ICDE, pages 865--876, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Z. Chen, D. V. Kalashnikov, and S. Mehrotra. Adaptive graphical approach to entity resolution. In Proc. of the ACM/IEEE Joint Conference on Digital Libraries, pages 204--213, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In Proc. of the 1st Conference on Latin American Web Congress, page 37, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Garcia-Molina. Pair-wise entity resolution: Overview and challenges. In Proc. CIKM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proc. 29th SIGIR, pages 284--291, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. of the 16th International Conference on World Wide Web, pages 141--150, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Najork. Systems and methods for inferring uniform resource locator (URL) normalization rules, 2006. US Patent Application Publication, 2006/0218143.Google ScholarGoogle Scholar
  17. A. Pereira, R. Baeza-Yates, and N. Ziviani. Where and how duplicates occur in the web. In Proc. of the 4th Latin American Web Congress, pages 127--134, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. De-duping URLs via rewrite rules

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
          August 2008
          1116 pages
          ISBN:9781605581934
          DOI:10.1145/1401890
          • General Chair:
          • Ying Li,
          • Program Chairs:
          • Bing Liu,
          • Sunita Sarawagi

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 August 2008

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader