skip to main content
10.1145/1148170.1148222acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Finding near-duplicate web pages: a large-scale evaluation of algorithms

Published:06 August 2006Publication History

ABSTRACT

Broder et al.'s [3] shingling algorithm and Charikar's [4] random projection based approach are considered "state-of-the-art" algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on different sites. Since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al.'s algorithm. We present a combined algorithm which achieves precision 0.79 with 79% of the recall of the other algorithms.

References

  1. S. Brin, J. Davis, and H. Garcia-Molina. Copy Detection Mechanisms for Digital Documents. In 1995 ACM SIGMOD International Conference on Management of Data (May 1995), 398--409.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Broder. Some applications of Rabin's fingerprinting method. In Renato Capocelli, Alfredo De Santis, and Ugo Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, 1993:143--152.]]Google ScholarGoogle Scholar
  3. A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic Clustering of the Web. In 6th International World Wide Web Conference (Apr. 1997), 393--404.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In 34th Annual ACM Symposium on Theory of Computing (May 2002).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. S. Charikar. Private communication.]]Google ScholarGoogle Scholar
  6. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In 6th Symposium on Operating System Design and Implementation (Dec. 2004), 137--150.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Fetterly, M. Manasse, and M. Najork. On the Evolution of Clusters of Near-Duplicate Web Pages. In 1st Latin American Web Congress (Nov. 2003), 37--45.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Fetterly, M. Manasse, and M. Najork. Detecting Phrase-Level Duplication on the World Wide Web. To appear in 28th Annual International ACM SIGIR Conference (Aug. 2005).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. Heintze. Scalable Document Fingerprinting. In Proc. of the 2nd USENIX Workshop on Electronic Commerce (Nov 1996).]]Google ScholarGoogle Scholar
  10. T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarised documents. Journal of the American Society for Information Science and Technology 54(3):203--215, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. U. Manber. Finding similar files in a large file system. In Proc. of the USENIX Winter 1994 Technical Conference (Jan. 1994).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Rabin. Fingerprinting by random polynomials. Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.]]Google ScholarGoogle Scholar
  13. N. Shivakumar and H. Garcia-Molina. SCAM: a copy detection mechanism for digital documents. In Proc. International Conference on Theory and Practice of Digital Libraries (June 1995).]]Google ScholarGoogle Scholar
  14. N. Shivakumar and H. Garcia-Molina. Building a scalable and accurate copy detection mechanism. In Proc. ACM Conference on Digital Libraries (March 1996), 160--168.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In Proc. Workshop on Web Databases (March 1998), 204--212.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Finding near-duplicate web pages: a large-scale evaluation of algorithms

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
          August 2006
          768 pages
          ISBN:1595933697
          DOI:10.1145/1148170

          Copyright © 2006 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 August 2006

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate792of3,983submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader