ACM Home Page
Please provide us with feedback. Feedback
Near-duplicate detection by instance-level constrained clustering
Full text PdfPdf (272 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Seattle, Washington, USA
SESSION: Clustering table of contents
Pages: 421 - 428  
Year of Publication: 2006
ISBN:1-59593-369-7
Authors
Hui Yang  Carnegie Mellon University, Pittsburgh, PA
Jamie Callan  Carnegie Mellon University, Pittsburgh, PA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 25,   Downloads (12 Months): 183,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1148170.1148243
What is a DOI?

ABSTRACT

For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both "almost-identical" documents in the data cleaning task and "relevant" documents in the search task. This paper presents an instance-level constrained clustering approach for near-duplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form near-duplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other near-duplicate detection algorithms and as about as effective as human assessors.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Y. Bernstein and J. Zobel, A scalable system for identifying co-derivative documents. In Proceedings of the String Processing and Information Retrieval Symposium (SPIRE), page 55--67, Padova, Italy, September 2004.
 
3
4
5
 
6
K. Gwet. Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement between Raters. Statistical Methods for Inter-rater Reliability Assessment, No.1, April 2002.
 
7
 
8
9
10
 
11
NIST, "Secure Hash Standard", Federal Information Processing Standards Publication 180--1, 1995.
 
12
N. Shivakumar and H. Garcia-Molina. SCAM: a copy detection mechanism for digital documents. In Proc. International Conference on Theory and Practice of Digital Libraries, Austin, Texas, June 1995.
 
13
S.W. Shulman, E-Rulemaking: Issues in Current Research and Practice. International Journal of Public Administration 28: 621--641. 2005.
 
14
 
15
E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems, 2003.
 
16
17