skip to main content
10.1145/2247596.2247642acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

CRSI: a compact randomized similarity index for set-valued features

Published:27 March 2012Publication History

ABSTRACT

We propose a similarity index for set-valued features and study algorithms for executing various set similarity queries on it. Such queries are fundamental for many application areas, including data integration and cleaning, data profiling as well as near duplicate document detection. In this paper, we focus on Jaccard similarity and present estimators that work for arbitrary similarity thresholds based on a single similarity index. We show how to build this similarity index a-priori, without knowledge about query similarity thresholds, based on recently proposed synopses for multiset operations. The index is deployed using existing disk-based inverted indexing implementations and our algorithms exploit available techniques, like skip-lists, to further optimize the query performance. The index has provably small space footprints, is orders of magnitude smaller and faster to create/incrementally maintain than exact solutions, and the algorithms provide approximate answers, with an error that is controlled by a user-specified parameter. We prove the error bounds of our algorithms analytically, and, finally, we demonstrate the performance of the algorithms and verify their accuracy experimentally.

References

  1. Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient Exact Set-Similarity Joins. In VLDB, pages 918--929, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting Distinct Elements in a Data Stream. In RANDOM, pages 1--10, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all Pairs Similarity Search. In WWW, pages 131--140, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Kevin Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. On Synopses for Distinct-Value Estimation under Multiset Operations. In SIGMOD, pages 199--210, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. Efficient Query Evaluation using a Two-Level Retrieval Process. In CIKM, pages 426--434, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise Independent Permutations (extended abstract). In STOC, pages 327--336, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic Clustering of the Web. Comput. Netw. ISDN Syst., 29:1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Moses S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In STOC, pages 380--388, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. A Primitive Operator for Similarity Joins in Data Cleaning. In ICDE, page 5, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Steve Chien and Nicole Immorlica. Semantic Similarity between Search Engine Queries using Temporal Correlation. In WWW, pages 2--11, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Edith Cohen and Haim Kaplan. Bottom-k Sketches: Better and More Efficient Estimation of Aggregates. In SIGMETRICS, pages 353--354, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Edith Cohen and Haim Kaplan. Summarizing Data using Bottom-k Sketches. In PODC, pages 225--234, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ronald Fagin, Ravi Kumar, and D. Sivakumar. Efficient Similarity Search and Classification via Rank Aggregation. In SIGMOD, pages 301--312, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity Search in high Dimensions via Hashing. In VLDB, pages 518--529, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, and Divesh Srivastava. Hashed Samples: Selectivity Estimators for Set Similarity Selection Queries. PVLDB, pages 201--212, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Monika Henzinger. Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. In SIGIR, pages 284--291, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Piotr Indyk and Rajeev Motwani. Approximate Nearest Neighbors: Towards removing the Curse of Dimensionality. In STOC, pages 604--613, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. ACM Comput. Surv., pages 43--45, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mehran Sahami and Timothy D. Heilman. A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets. In WWW, pages 377--386, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sunita Sarawagi and Alok Kirpal. Efficient Set Joins on Similarity Predicates. In SIGMOD, pages 743--754, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ellen Spertus, Mehran Sahami, and Orkut Buyukkokten. Evaluating Similarity Measures: a Large-Scale Study in the Orkut Social Network. In KDD, pages 678--684, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Martin Theobald, Jonathan Siddharth, and Andreas Paepcke. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections. In SIGIR, pages 563--570, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Chuan Xiao, Wei Wang, Xuemin Lin, and Haichuan Shang. Top-k Set Similarity Joins. In ICDE, pages 916--927, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. In WWW, pages 131--140, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CRSI: a compact randomized similarity index for set-valued features

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology
      March 2012
      643 pages
      ISBN:9781450307901
      DOI:10.1145/2247596

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 March 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate7of10submissions,70%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader