ABSTRACT
We propose a similarity index for set-valued features and study algorithms for executing various set similarity queries on it. Such queries are fundamental for many application areas, including data integration and cleaning, data profiling as well as near duplicate document detection. In this paper, we focus on Jaccard similarity and present estimators that work for arbitrary similarity thresholds based on a single similarity index. We show how to build this similarity index a-priori, without knowledge about query similarity thresholds, based on recently proposed synopses for multiset operations. The index is deployed using existing disk-based inverted indexing implementations and our algorithms exploit available techniques, like skip-lists, to further optimize the query performance. The index has provably small space footprints, is orders of magnitude smaller and faster to create/incrementally maintain than exact solutions, and the algorithms provide approximate answers, with an error that is controlled by a user-specified parameter. We prove the error bounds of our algorithms analytically, and, finally, we demonstrate the performance of the algorithms and verify their accuracy experimentally.
- Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient Exact Set-Similarity Joins. In VLDB, pages 918--929, 2006. Google ScholarDigital Library
- Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999. Google ScholarDigital Library
- Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting Distinct Elements in a Data Stream. In RANDOM, pages 1--10, 2002. Google ScholarDigital Library
- Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all Pairs Similarity Search. In WWW, pages 131--140, 2007. Google ScholarDigital Library
- Kevin Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. On Synopses for Distinct-Value Estimation under Multiset Operations. In SIGMOD, pages 199--210, 2007. Google ScholarDigital Library
- Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. Efficient Query Evaluation using a Two-Level Retrieval Process. In CIKM, pages 426--434, 2003. Google ScholarDigital Library
- Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise Independent Permutations (extended abstract). In STOC, pages 327--336, 1998. Google ScholarDigital Library
- Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic Clustering of the Web. Comput. Netw. ISDN Syst., 29:1157--1166, 1997. Google ScholarDigital Library
- Moses S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In STOC, pages 380--388, 2002. Google ScholarDigital Library
- Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. A Primitive Operator for Similarity Joins in Data Cleaning. In ICDE, page 5, 2006. Google ScholarDigital Library
- Steve Chien and Nicole Immorlica. Semantic Similarity between Search Engine Queries using Temporal Correlation. In WWW, pages 2--11, 2005. Google ScholarDigital Library
- Edith Cohen and Haim Kaplan. Bottom-k Sketches: Better and More Efficient Estimation of Aggregates. In SIGMETRICS, pages 353--354, 2007. Google ScholarDigital Library
- Edith Cohen and Haim Kaplan. Summarizing Data using Bottom-k Sketches. In PODC, pages 225--234, 2007. Google ScholarDigital Library
- Ronald Fagin, Ravi Kumar, and D. Sivakumar. Efficient Similarity Search and Classification via Rank Aggregation. In SIGMOD, pages 301--312, 2003. Google ScholarDigital Library
- Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity Search in high Dimensions via Hashing. In VLDB, pages 518--529, 1999. Google ScholarDigital Library
- Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, and Divesh Srivastava. Hashed Samples: Selectivity Estimators for Set Similarity Selection Queries. PVLDB, pages 201--212, 2008. Google ScholarDigital Library
- Monika Henzinger. Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. In SIGIR, pages 284--291, 2006. Google ScholarDigital Library
- Piotr Indyk and Rajeev Motwani. Approximate Nearest Neighbors: Towards removing the Curse of Dimensionality. In STOC, pages 604--613, 1998. Google ScholarDigital Library
- Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. ACM Comput. Surv., pages 43--45, 1996. Google ScholarDigital Library
- Mehran Sahami and Timothy D. Heilman. A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets. In WWW, pages 377--386, 2006. Google ScholarDigital Library
- Sunita Sarawagi and Alok Kirpal. Efficient Set Joins on Similarity Predicates. In SIGMOD, pages 743--754, 2004. Google ScholarDigital Library
- Ellen Spertus, Mehran Sahami, and Orkut Buyukkokten. Evaluating Similarity Measures: a Large-Scale Study in the Orkut Social Network. In KDD, pages 678--684, 2005. Google ScholarDigital Library
- Martin Theobald, Jonathan Siddharth, and Andreas Paepcke. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections. In SIGIR, pages 563--570, 2008. Google ScholarDigital Library
- Chuan Xiao, Wei Wang, Xuemin Lin, and Haichuan Shang. Top-k Set Similarity Joins. In ICDE, pages 916--927, 2009. Google ScholarDigital Library
- Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. In WWW, pages 131--140, 2008. Google ScholarDigital Library
Index Terms
- CRSI: a compact randomized similarity index for set-valued features
Recommendations
DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data MiningMetrics for set similarity are a core aspect of several data mining tasks. To remove duplicate results in a Web search, for example, a common approach looks at the Jaccard index between all pairs of pages. In social network analysis, a much-celebrated ...
Subgraph Matching with Set Similarity in a Large Graph Database
In real-world graphs such as social networks, Semantic Web and biological networks, each vertex usually contains rich information, which can be modeled by a set of tokens or elements. In this paper, we study a subgraph matching with set similarity (SMS<...
Efficient estimation for high similarities using odd sketches
WWW '14: Proceedings of the 23rd international conference on World wide webEstimating set similarity is a central problem in many computer applications. In this paper we introduce the Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. The exclusive-or of two sketches equals the sketch of the ...
Comments