|
ABSTRACT
Broder et al.'s [3] shingling algorithm and Charikar's [4] random projection based approach are considered "state-of-the-art" algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on different sites. Since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al.'s algorithm. We present a combined algorithm which achieves precision 0.79 with 79% of the recall of the other algorithms.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
2
|
A. Broder. Some applications of Rabin's fingerprinting method. In Renato Capocelli, Alfredo De Santis, and Ugo Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, 1993:143--152.
|
| |
3
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
 |
4
|
|
| |
5
|
M. S. Charikar. Private communication.
|
| |
6
|
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In 6th Symposium on Operating System Design and Implementation (Dec. 2004), 137--150.
|
| |
7
|
|
 |
8
|
|
| |
9
|
N. Heintze. Scalable Document Fingerprinting. In Proc. of the 2nd USENIX Workshop on Electronic Commerce (Nov 1996).
|
| |
10
|
|
| |
11
|
U. Manber. Finding similar files in a large file system. In Proc. of the USENIX Winter 1994 Technical Conference (Jan. 1994).
|
| |
12
|
M. Rabin. Fingerprinting by random polynomials. Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
|
| |
13
|
N. Shivakumar and H. Garcia-Molina. SCAM: a copy detection mechanism for digital documents. In Proc. International Conference on Theory and Practice of Digital Libraries (June 1995).
|
 |
14
|
|
| |
15
|
|
CITED BY 10
|
Ondřej Chum , James Philbin , Michael Isard , Andrew Zisserman, Scalable near identical image and shot detection, Proceedings of the 6th ACM international conference on Image and video retrieval, p.549-556, July 09-11, 2007, Amsterdam, The Netherlands
|
|
Scott Huffman , April Lehman , Alexei Stolboushkin , Howard Wong-Toi , Fan Yang , Hein Roehrig, Multiple-signal duplicate detection for search evaluation, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
|
|
|
|
|
|
Raymond K. Pon , Alfonso F. Cardenas , David Buttler , Terence Critchlow, Tracking multiple topics for finding interesting articles, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
Rui Cai , Jiang-Ming Yang , Wei Lai , Yida Wang , Lei Zhang, iRobot: an intelligent crawler for web forums, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
|
|
|
|
|