skip to main content
10.1145/1277741.1277782acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Multiple-signal duplicate detection for search evaluation

Published: 23 July 2007 Publication History

Abstract

We consider the problem of duplicate document detection for search evaluation. Given a query and a small number of web results for that query, we show how to detect duplicate web documents with precision ~0.91 and recall ~77. In contrast, Charikar's algorithm, designed for duplicate detection in an indexing pipeline, achieves precision ~0.91 but with a recall of ~0.58. Our improvement in recall while maintaining high precision comes from combining three ideas. First, because we are only concerned with duplicate detection among results for the same query, the number of pairwise comparisons is small. Therefore we can afford to compute multiple pairwise signals for each pair of documents. A model learned with standard machine-learning techniques improves recall to ~0.68 with precision ~0.90. Second, most duplicate detection has focused on text analysis of the HTML contents of a document. In some web pages the HTML is not a good indicator of the final contents of the page. We use extended fetching techniques to fill in frames and execute Java script. Including signals based on our richer fetches further improves the recall to ~0.75 and the precision to ~0.91. Finally, we also explore using signals based on the query. Comparing contextual snippets based on the richer fetches improves the recall to ~0.77. We show that the overall accuracy of this final model approaches that of human judges.

References

[1]
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In M. J. Carey and D. A. Schneider, editors, SIGMOD Conference, pages 398--409. ACM Press, 1995.
[2]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997.
[3]
M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380--388, 2002.
[4]
R. Cilibrasi and P. Vitanyi. Clustering by compression. IEEE Transactions on Information Theory, 51(4):1523--1545, 2005.
[5]
W. W. Cohen. Fast effective rule induction. In A. Prieditis and S. Russell, editors, Proc. of the 12th International Conference on Machine Learning, pages 115--123, Tahoe City, CA, July 9-12, 1995. Morgan Kaufmann.
[6]
Crowbar. http://simile.mit.edu/wiki/Crowbar.
[7]
B. Gomes and B. Smith. Detecting query-specific duplicate documents, 2003. U.S. Patent 6,615,209.
[8]
N. Heintze. Scalable document fingerprinting. In USENIX Workshop on Electronic Commerce, 1996.
[9]
M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In E. N. Efthimiadis, S. T. Dumais, D. Hawking, and K. Jarvelin, editors, SIGIR, pages 284--291. ACM, 2006.
[10]
K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422--446, 2002.
[11]
U. Manber. Finding similar files in a large file system. In USENIX Winter, pages 1--10, 1994.
[12]
Open directory project. http://www.dmoz.org.
[13]
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[14]
S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In A. Y. Halevy, Z. G. Ives, and A. Doan, editors, SIGMOD Conference, pages 76--85. ACM, 2003.
[15]
N. Shivakumar and H. Garcia-Molina. Scam: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.
[16]
N. Shivakumar and H. Garcia-Molina. Building a scalable and accurate copy detection mechanism. In Proceedings of the First ACM Conference on Digital Libraries, 1996.
[17]
N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In WEBDB: International Workshop on the World Wide Web and Databases, 1998.
[18]
I. H. Witten and E. Frank. Weka: Practical machine learning tools and techniques with Java implementations. Morgan-Kaufmann, 1999.

Cited By

View all

Index Terms

  1. Multiple-signal duplicate detection for search evaluation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
    July 2007
    946 pages
    ISBN:9781595935977
    DOI:10.1145/1277741
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 July 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. compression distance
    2. duplicate detection
    3. machine learning
    4. search evaluation
    5. similarity hash
    6. web search

    Qualifiers

    • Article

    Conference

    SIGIR07
    Sponsor:
    SIGIR07: The 30th Annual International SIGIR Conference
    July 23 - 27, 2007
    Amsterdam, The Netherlands

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Web Search Result De-duplication and ClusteringEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_326(4661-4666)Online publication date: 7-Dec-2018
    • (2016)Web Search Result De-duplication and ClusteringEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_326-2(1-6)Online publication date: 20-Dec-2016
    • (2014)Reducing Redundant Information in Search Results Employing Approximation AlgorithmsDatabase and Expert Systems Applications10.1007/978-3-319-10085-2_22(240-247)Online publication date: 2014
    • (2013)Reducing information redundancy in search resultsProceedings of the 28th Annual ACM Symposium on Applied Computing10.1145/2480362.2480533(886-893)Online publication date: 18-Mar-2013
    • (2013)Detecting near-duplicate documents using sentence-level features and supervised learningExpert Systems with Applications: An International Journal10.1016/j.eswa.2012.08.04540:5(1467-1476)Online publication date: 1-Apr-2013
    • (2012)Robust and Efficient Near-Duplicate Documents Detection on P2P Networks2012 Spring Congress on Engineering and Technology10.1109/SCET.2012.6342140(1-4)Online publication date: May-2012
    • (2012)Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web PagesProceedings of the 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies10.1109/PDCAT.2012.108(523-528)Online publication date: 14-Dec-2012
    • (2012)ReferencesMultimedia Information Extraction10.1002/9781118219546.refs(425-460)Online publication date: 24-Aug-2012
    • (2009)Automatic video tagging using content redundancyProceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval10.1145/1571941.1572010(395-402)Online publication date: 19-Jul-2009
    • (2009)Exploiting Sentence-Level Features for Near-Duplicate Document DetectionProceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology10.1007/978-3-642-04769-5_18(205-217)Online publication date: 1-Oct-2009
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media