research-article

CRSI: a compact randomized similarity index for set-valued features

Authors:
Petros Venetis

Stanford University

Stanford University
View Profile

,
Yannis Sismanis

IBM Research -- Almaden

IBM Research -- Almaden
View Profile

,
Berthold Reinwald

IBM Research -- Almaden

IBM Research -- Almaden
View Profile

EDBT '12: Proceedings of the 15th International Conference on Extending Database TechnologyMarch 2012Pages 384–395https://doi.org/10.1145/2247596.2247642

Published:27 March 2012Publication History

EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

Pages 384–395

ABSTRACT

We propose a similarity index for set-valued features and study algorithms for executing various set similarity queries on it. Such queries are fundamental for many application areas, including data integration and cleaning, data profiling as well as near duplicate document detection. In this paper, we focus on Jaccard similarity and present estimators that work for arbitrary similarity thresholds based on a single similarity index. We show how to build this similarity index a-priori, without knowledge about query similarity thresholds, based on recently proposed synopses for multiset operations. The index is deployed using existing disk-based inverted indexing implementations and our algorithms exploit available techniques, like skip-lists, to further optimize the query performance. The index has provably small space footprints, is orders of magnitude smaller and faster to create/incrementally maintain than exact solutions, and the algorithms provide approximate answers, with an error that is controlled by a user-specified parameter. We prove the error bounds of our algorithms analytically, and, finally, we demonstrate the performance of the algorithms and verify their accuracy experimentally.

References

Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient Exact Set-Similarity Joins. In VLDB, pages 918--929, 2006. Google ScholarDigital Library
Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999. Google ScholarDigital Library
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting Distinct Elements in a Data Stream. In RANDOM, pages 1--10, 2002. Google ScholarDigital Library
Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all Pairs Similarity Search. In WWW, pages 131--140, 2007. Google ScholarDigital Library
Kevin Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. On Synopses for Distinct-Value Estimation under Multiset Operations. In SIGMOD, pages 199--210, 2007. Google ScholarDigital Library
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. Efficient Query Evaluation using a Two-Level Retrieval Process. In CIKM, pages 426--434, 2003. Google ScholarDigital Library
Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise Independent Permutations (extended abstract). In STOC, pages 327--336, 1998. Google ScholarDigital Library
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic Clustering of the Web. Comput. Netw. ISDN Syst., 29:1157--1166, 1997. Google ScholarDigital Library
Moses S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In STOC, pages 380--388, 2002. Google ScholarDigital Library
Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. A Primitive Operator for Similarity Joins in Data Cleaning. In ICDE, page 5, 2006. Google ScholarDigital Library
Steve Chien and Nicole Immorlica. Semantic Similarity between Search Engine Queries using Temporal Correlation. In WWW, pages 2--11, 2005. Google ScholarDigital Library
Edith Cohen and Haim Kaplan. Bottom-k Sketches: Better and More Efficient Estimation of Aggregates. In SIGMETRICS, pages 353--354, 2007. Google ScholarDigital Library
Edith Cohen and Haim Kaplan. Summarizing Data using Bottom-k Sketches. In PODC, pages 225--234, 2007. Google ScholarDigital Library
Ronald Fagin, Ravi Kumar, and D. Sivakumar. Efficient Similarity Search and Classification via Rank Aggregation. In SIGMOD, pages 301--312, 2003. Google ScholarDigital Library
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity Search in high Dimensions via Hashing. In VLDB, pages 518--529, 1999. Google ScholarDigital Library
Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, and Divesh Srivastava. Hashed Samples: Selectivity Estimators for Set Similarity Selection Queries. PVLDB, pages 201--212, 2008. Google ScholarDigital Library
Monika Henzinger. Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. In SIGIR, pages 284--291, 2006. Google ScholarDigital Library
Piotr Indyk and Rajeev Motwani. Approximate Nearest Neighbors: Towards removing the Curse of Dimensionality. In STOC, pages 604--613, 1998. Google ScholarDigital Library
Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. ACM Comput. Surv., pages 43--45, 1996. Google ScholarDigital Library
Mehran Sahami and Timothy D. Heilman. A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets. In WWW, pages 377--386, 2006. Google ScholarDigital Library
Sunita Sarawagi and Alok Kirpal. Efficient Set Joins on Similarity Predicates. In SIGMOD, pages 743--754, 2004. Google ScholarDigital Library
Ellen Spertus, Mehran Sahami, and Orkut Buyukkokten. Evaluating Similarity Measures: a Large-Scale Study in the Orkut Social Network. In KDD, pages 678--684, 2005. Google ScholarDigital Library
Martin Theobald, Jonathan Siddharth, and Andreas Paepcke. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections. In SIGIR, pages 563--570, 2008. Google ScholarDigital Library
Chuan Xiao, Wei Wang, Xuemin Lin, and Haichuan Shang. Top-k Set Similarity Joins. In ICDE, pages 916--927, 2009. Google ScholarDigital Library
Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. In WWW, pages 131--140, 2008. Google ScholarDigital Library

Index Terms

CRSI: a compact randomized similarity index for set-valued features
1. Information systems
  1. Information systems applications

Recommendations

DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Metrics for set similarity are a core aspect of several data mining tasks. To remove duplicate results in a Web search, for example, a common approach looks at the Jaccard index between all pairs of pages. In social network analysis, a much-celebrated ...
Read More
Subgraph Matching with Set Similarity in a Large Graph Database
In real-world graphs such as social networks, Semantic Web and biological networks, each vertex usually contains rich information, which can be modeled by a set of tokens or elements. In this paper, we study a subgraph matching with set similarity (SMS<...
Read More
Efficient estimation for high similarities using odd sketches
WWW '14: Proceedings of the 23rd international conference on World wide web

Estimating set similarity is a central problem in many computer applications. In this paper we introduce the Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. The exclusive-or of two sketches equals the sketch of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology
March 2012
643 pages
ISBN:9781450307901
DOI:10.1145/2247596
Editors:
Elke Rundensteiner
Worcester Polytechnic Institute
,
Volker Markl
Technische Universität Berlin, Germany
,
Ioana Manolescu
INRIA, France
,
Sihem Amer-Yahia
QCRI, Doha, Qatar
,
Felix Naumann
Hasso Plattner Institute, Potsdam, Germany
,
Ismail Ari
Ozyegin University, Turkey
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 March 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
KMV synopsis
set similarity
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate7of10submissions,70%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 95
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CRSI: a compact randomized similarity index for set-valued features

EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication

Subgraph Matching with Set Similarity in a Large Graph Database

Efficient estimation for high similarities using odd sketches