Article

Multiple-signal duplicate detection for search evaluation

Authors:

Alexei Stolboushkin,

Howard Wong-Toi,

Hein RoehrigAuthors Info & Claims

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 223 - 230

https://doi.org/10.1145/1277741.1277782

Published: 23 July 2007 Publication History

Abstract

We consider the problem of duplicate document detection for search evaluation. Given a query and a small number of web results for that query, we show how to detect duplicate web documents with precision ~0.91 and recall ~77. In contrast, Charikar's algorithm, designed for duplicate detection in an indexing pipeline, achieves precision ~0.91 but with a recall of ~0.58. Our improvement in recall while maintaining high precision comes from combining three ideas. First, because we are only concerned with duplicate detection among results for the same query, the number of pairwise comparisons is small. Therefore we can afford to compute multiple pairwise signals for each pair of documents. A model learned with standard machine-learning techniques improves recall to ~0.68 with precision ~0.90. Second, most duplicate detection has focused on text analysis of the HTML contents of a document. In some web pages the HTML is not a good indicator of the final contents of the page. We use extended fetching techniques to fill in frames and execute Java script. Including signals based on our richer fetches further improves the recall to ~0.75 and the precision to ~0.91. Finally, we also explore using signals based on the query. Comparing contextual snippets based on the richer fetches improves the recall to ~0.77. We show that the overall accuracy of this final model approaches that of human judges.

References

[1]

S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In M. J. Carey and D. A. Schneider, editors, SIGMOD Conference, pages 398--409. ACM Press, 1995.

Digital Library

[2]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997.

Digital Library

[3]

M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380--388, 2002.

Digital Library

[4]

R. Cilibrasi and P. Vitanyi. Clustering by compression. IEEE Transactions on Information Theory, 51(4):1523--1545, 2005.

Digital Library

[5]

W. W. Cohen. Fast effective rule induction. In A. Prieditis and S. Russell, editors, Proc. of the 12th International Conference on Machine Learning, pages 115--123, Tahoe City, CA, July 9-12, 1995. Morgan Kaufmann.

[6]

Crowbar. http://simile.mit.edu/wiki/Crowbar.

[7]

B. Gomes and B. Smith. Detecting query-specific duplicate documents, 2003. U.S. Patent 6,615,209.

[8]

N. Heintze. Scalable document fingerprinting. In USENIX Workshop on Electronic Commerce, 1996.

[9]

M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In E. N. Efthimiadis, S. T. Dumais, D. Hawking, and K. Jarvelin, editors, SIGIR, pages 284--291. ACM, 2006.

Digital Library

[10]

K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422--446, 2002.

Digital Library

[11]

U. Manber. Finding similar files in a large file system. In USENIX Winter, pages 1--10, 1994.

Digital Library

[12]

Open directory project. http://www.dmoz.org.

[13]

J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

Digital Library

[14]

S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In A. Y. Halevy, Z. G. Ives, and A. Doan, editors, SIGMOD Conference, pages 76--85. ACM, 2003.

Digital Library

[15]

N. Shivakumar and H. Garcia-Molina. Scam: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.

Digital Library

[16]

N. Shivakumar and H. Garcia-Molina. Building a scalable and accurate copy detection mechanism. In Proceedings of the First ACM Conference on Digital Libraries, 1996.

Digital Library

[17]

N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In WEBDB: International Workshop on the World Wide Web and Databases, 1998.

Digital Library

[18]

I. H. Witten and E. Frank. Weka: Practical machine learning tools and techniques with Java implementations. Morgan-Kaufmann, 1999.

Digital Library

Cited By

Shen XZhai C(2018)Web Search Result De-duplication and ClusteringEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_326(4661-4666)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_326
Shen XZhai C(2016)Web Search Result De-duplication and ClusteringEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_326-2(1-6)Online publication date: 20-Dec-2016
https://doi.org/10.1007/978-1-4899-7993-3_326-2
Makris CPlegas YStamatiou YStavropoulos ETsakalidis A(2014)Reducing Redundant Information in Search Results Employing Approximation AlgorithmsDatabase and Expert Systems Applications10.1007/978-3-319-10085-2_22(240-247)Online publication date: 2014
https://doi.org/10.1007/978-3-319-10085-2_22
Show More Cited By

Index Terms

Multiple-signal duplicate detection for search evaluation
1. Information systems
  1. Information retrieval

Recommendations

Detecting duplicate objects in XML documents
IQIS '04: Proceedings of the 2004 international workshop on Information quality in information systems

The problem of detecting duplicate entities that describe the same real-world object (and purging them) is an important data cleansing task, necessary to improve data quality. For data stored in a flat relation, numerous solutions to this problem exist. ...
Improving query expansion using pseudo-relevant web knowledge for information retrieval
Highlights
- Web knowledge-based query expansion technique uses the top N pseudo relevant web pages
Abstract
In the field of information retrieval, query expansion (QE) has long been used as a technique to deal with the fundamental issue of word mismatch between a user’s query and the target information. In the context of the relationship ...
Efficient and Effective Duplicate Detection in Hierarchical Data

Although there is a long line of work on identifying duplicates in relational data, only a few solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this paper, we present a novel method for XML duplicate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

July 2007

946 pages

ISBN:9781595935977

DOI:10.1145/1277741

General Chairs:
Wessel Kraaij
TNO, The Netherlands
,
Arjen P. de Vries
CWI, The Netherlands
,
Program Chairs:
Charles L. A. Clarke
University of Waterloo, Canada
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Noriko Kando
National Institute of Informatics, Japan

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGIR07

Sponsor:

SIGIR07: The 30th Annual International SIGIR Conference

July 23 - 27, 2007

Amsterdam, The Netherlands

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
790
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shen XZhai C(2018)Web Search Result De-duplication and ClusteringEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_326(4661-4666)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_326
Shen XZhai C(2016)Web Search Result De-duplication and ClusteringEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_326-2(1-6)Online publication date: 20-Dec-2016
https://doi.org/10.1007/978-1-4899-7993-3_326-2
Makris CPlegas YStamatiou YStavropoulos ETsakalidis A(2014)Reducing Redundant Information in Search Results Employing Approximation AlgorithmsDatabase and Expert Systems Applications10.1007/978-3-319-10085-2_22(240-247)Online publication date: 2014
https://doi.org/10.1007/978-3-319-10085-2_22
Plegas YStamou SShin SMaldonado J(2013)Reducing information redundancy in search resultsProceedings of the 28th Annual ACM Symposium on Applied Computing10.1145/2480362.2480533(886-893)Online publication date: 18-Mar-2013
https://dl.acm.org/doi/10.1145/2480362.2480533
Lin YLiao TLee S(2013)Detecting near-duplicate documents using sentence-level features and supervised learningExpert Systems with Applications: An International Journal10.1016/j.eswa.2012.08.04540:5(1467-1476)Online publication date: 1-Apr-2013
https://dl.acm.org/doi/10.1016/j.eswa.2012.08.045
Zhang YDai XWang CWei Z(2012)Robust and Efficient Near-Duplicate Documents Detection on P2P Networks2012 Spring Congress on Engineering and Technology10.1109/SCET.2012.6342140(1-4)Online publication date: May-2012
https://doi.org/10.1109/SCET.2012.6342140
Wei YWang SYuan CHuang Y(2012)Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web PagesProceedings of the 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies10.1109/PDCAT.2012.108(523-528)Online publication date: 14-Dec-2012
https://dl.acm.org/doi/10.1109/PDCAT.2012.108
(2012)ReferencesMultimedia Information Extraction10.1002/9781118219546.refs(425-460)Online publication date: 24-Aug-2012
https://doi.org/10.1002/9781118219546.refs
Siersdorfer SSan Pedro JSanderson MAllan JAslam JSanderson MZhai CZobel J(2009)Automatic video tagging using content redundancyProceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval10.1145/1571941.1572010(395-402)Online publication date: 19-Jul-2009
https://dl.acm.org/doi/10.1145/1571941.1572010
Wang JChang H(2009)Exploiting Sentence-Level Features for Near-Duplicate Document DetectionProceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology10.1007/978-3-642-04769-5_18(205-217)Online publication date: 1-Oct-2009
https://dl.acm.org/doi/10.1007/978-3-642-04769-5_18
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten