ABSTRACT
Broder et al.'s [3] shingling algorithm and Charikar's [4] random projection based approach are considered "state-of-the-art" algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on different sites. Since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al.'s algorithm. We present a combined algorithm which achieves precision 0.79 with 79% of the recall of the other algorithms.
- S. Brin, J. Davis, and H. Garcia-Molina. Copy Detection Mechanisms for Digital Documents. In 1995 ACM SIGMOD International Conference on Management of Data (May 1995), 398--409.]] Google ScholarDigital Library
- A. Broder. Some applications of Rabin's fingerprinting method. In Renato Capocelli, Alfredo De Santis, and Ugo Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, 1993:143--152.]]Google Scholar
- A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic Clustering of the Web. In 6th International World Wide Web Conference (Apr. 1997), 393--404.]] Google ScholarDigital Library
- M. S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In 34th Annual ACM Symposium on Theory of Computing (May 2002).]] Google ScholarDigital Library
- M. S. Charikar. Private communication.]]Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In 6th Symposium on Operating System Design and Implementation (Dec. 2004), 137--150.]] Google ScholarDigital Library
- D. Fetterly, M. Manasse, and M. Najork. On the Evolution of Clusters of Near-Duplicate Web Pages. In 1st Latin American Web Congress (Nov. 2003), 37--45.]] Google ScholarDigital Library
- D. Fetterly, M. Manasse, and M. Najork. Detecting Phrase-Level Duplication on the World Wide Web. To appear in 28th Annual International ACM SIGIR Conference (Aug. 2005).]] Google ScholarDigital Library
- N. Heintze. Scalable Document Fingerprinting. In Proc. of the 2nd USENIX Workshop on Electronic Commerce (Nov 1996).]]Google Scholar
- T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarised documents. Journal of the American Society for Information Science and Technology 54(3):203--215, 2003.]] Google ScholarDigital Library
- U. Manber. Finding similar files in a large file system. In Proc. of the USENIX Winter 1994 Technical Conference (Jan. 1994).]] Google ScholarDigital Library
- M. Rabin. Fingerprinting by random polynomials. Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.]]Google Scholar
- N. Shivakumar and H. Garcia-Molina. SCAM: a copy detection mechanism for digital documents. In Proc. International Conference on Theory and Practice of Digital Libraries (June 1995).]]Google Scholar
- N. Shivakumar and H. Garcia-Molina. Building a scalable and accurate copy detection mechanism. In Proc. ACM Conference on Digital Libraries (March 1996), 160--168.]] Google ScholarDigital Library
- N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In Proc. Workshop on Web Databases (March 1998), 204--212.]] Google ScholarDigital Library
Index Terms
- Finding near-duplicate web pages: a large-scale evaluation of algorithms
Recommendations
On the evolution of clusters of near-duplicate web pages
This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basisover the span of 11 weeks. We then determined which of these pages are near-...
Finding and classifying web units in websites
In web classification, most researchers assume that the objects to be classified are individual web pages from one or more websites. In practice, the assumption is too restrictive since a web page itself may not carry sufficient information for it to be ...
Finding pages on the unarchived web
JCDL '14: Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital LibrariesWeb archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies---most of the Web is unarchived and therefore lost to posterity. In this paper, we ...
Comments