research-article

Set similarity search beyond MinHash

Authors:
Tobias Christiani

IT University of Copenhagen, Denmark

IT University of Copenhagen, Denmark
View Profile

,
Rasmus Pagh

IT University of Copenhagen, Denmark

IT University of Copenhagen, Denmark
View Profile

STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of ComputingJune 2017Pages 1094–1107https://doi.org/10.1145/3055399.3055443

Published:19 June 2017Publication History

STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

Pages 1094–1107

ABSTRACT

We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b₁, b₂)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists x Ε P with B(q, x) ≥ b₁, then we can efficiently return x′ Ε P with B(q, x′) > b₂.

We present a simple data structure that solves this problem with space usage O(n^1+ρlogn + ∑_{x ε P}|x|) and query time O(|q|n^ρ logn) where n = |P| and ρ = log(1/b₁)/log(1/b₂). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 < b₂ < b₁ < 1.

In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).

Supplemental Material

d4_sc_t2.mp4

mp4

148.8 MB

Download

References

T. D. Ahle, R. Pagh, I. P. Razenshteyn, and F. Silvestri. 2016. On the Complexity of Inner Product Similarity Join. In Proc. PODS’16. 151–164. Google ScholarDigital Library
A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. 2015. Practical and optimal LSH for angular distance. In Proc. NIPS ’15. 1225–1233. Google ScholarDigital Library
A. Andoni, P. Indyk, H. L. Nguyen, and I. P. Razenshteyn. 2014. Beyond Locality-Sensitive Hashing. In Proc. SODA ’14. 1018–1028. Google ScholarDigital Library
A. Andoni, T. Laarhoven, I. P. Razenshteyn, and E. Waingarten. 2017. Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors. In Proc. SODA ’17. 47–66. Google ScholarDigital Library
A. Andoni and I. Razenshteyn. 2015.Google Scholar
Optimal Data-Dependent Hashing for Approximate Near Neighbors. In Proc. STOC ’15. 793–801. Google ScholarDigital Library
A. Andoni and I. Razensteyn. 2016. Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing. In Proc. SoCG ’16. 9:1–9:11.Google Scholar
Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. 2006. Efficient exact setsimilarity joins. In Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, 918–929. Google ScholarDigital Library
Roberto J Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007.Google Scholar
Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web. ACM, 131–140. Google ScholarDigital Library
A. Becker, L. Ducas, N. Gama, and T. Laarhoven. 2016. New directions in nearest neighbor searching with applications to lattice sieving. In Proc. SODA ’16. 10–24. Google ScholarDigital Library
Josias Braun-Blanquet. 1932.Google Scholar
Plant sociology. The study of plant communities. McGraw-Hill.Google Scholar
Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings. IEEE, 21–29. Google ScholarDigital Library
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. Computer Networks and ISDN Systems 29, 8 (1997), 1157–1166. Google ScholarDigital Library
M. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proc. STOC ’02. 380–388. Google ScholarDigital Library
F. Chierichetti and R. Kumar. 2015. LSH-Preserving Functions and Their Applications. J. ACM 62, 5 (2015), 33. Google ScholarDigital Library
S. Choi, S. Cha, and C. C. Tappert. 2010.Google Scholar
A survey of binary similarity and distance measures. J. Syst. Cybern. Informatics 8, 1 (2010), 43–48.Google Scholar
T. Christiani. 2017. A Framework for Similarity Search with Space-Time Tradeoffs using Locality-Sensitive Filtering. In Proc. SODA ’17. 31–46. Google ScholarDigital Library
E. Cohen. 1997. Size-estimation framework with applications to transitive closure and reachability. J. Comp. Syst. Sci. 55, 3 (1997), 441–453. Google ScholarDigital Library
E. Cohen and H. Kaplan. 2009.Google Scholar
Leveraging discarded samples for tighter estimation of multiple-set aggregates. ACM SIGMETRICS Performance Evaluation Review 37, 1 (2009), 251–262. Google ScholarDigital Library
M. Dubiner. 2010. Bucketing coding and information theory for the statistical high-dimensional nearest-neighbor problem. IEEE Trans. Information Theory 56, 8 (2010), 4166–4179. Google ScholarDigital Library
T. Hagerup. 1998. Sorting and Searching on the Word RAM. In Proc. STACS ’98. 366–398. Google ScholarDigital Library
S. Har-Peled, P. Indyk, and R. Motwani. 2012. Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality. Theory of computing 8, 1 (2012), 321–350.Google Scholar
W. Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Jour. Am. Stat. Assoc. 58, 301 (1963), 13–30.Google ScholarCross Ref
P. Indyk and R. Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proc. STOC ’98. 604–613. Google ScholarDigital Library
T. Laarhoven. 2015.Google Scholar
Tradeoffs for nearest neighbors on the sphere. CoRR abs/1511.07527 (2015). http://arxiv.org/abs/1511.07527Google Scholar
Ping Li and Arnd Christian König. 2011. Theory and applications of b-bit minwise hashing. Commun. ACM 54, 8 (2011), 101–109. Google ScholarDigital Library
M. Mitzenmacher, R. Pagh, and N. Pham. 2014.Google Scholar
Efficient estimation for high similarities using odd sketches. In Proc. WWW ’14. 109–118. Google ScholarDigital Library
M. Mitzenmacher and E. Upfal. 2005.Google Scholar
Probability and computing. Cambridge University Press, New York, NY.Google Scholar
R. Motwani, A. Naor, and R. Panigrahy. 2007. Lower Bounds on Locality Sensitive Hashing. SIAM J. Discrete Math. 21, 4 (2007), 930–935. Google ScholarDigital Library
Rajeev Motwani and Prabhakar Raghavan. 2010.Google Scholar
Randomized algorithms. Chapman & Hall/CRC.Google Scholar
R. O’Donnell, Y. Wu, and Y. Zhou. 2014.Google Scholar
Optimal lower bounds for localitysensitive hashing (except when q is tiny). ACM Transactions on Computation Theory (TOCT) 6, 1 (2014), 5. Google ScholarDigital Library
Rasmus Pagh, Morten Stöckel, and David P Woodruff. 2014. Is min-wise hashing optimal for summarizing set intersection?. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 109–120. Google ScholarDigital Library
R. Panigrahy, K. Talwar, and U. Wieder. 2010. Lower Bounds on Near Neighbor Search via Metric Expansion. In Proc. FOCS ’10. 805–814. Google ScholarDigital Library
Anshumali Shrivastava and Ping Li. 2015.Google Scholar
Asymmetric minwise hashing for indexing binary inner products and set containment. In Proceedings of the 24th International Conference on World Wide Web. ACM, 981–991. Google ScholarDigital Library
K. Terasawa and Y. Tanaka. 2007.Google Scholar
Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere. In Proc. WADS ’07. 27–38.Google Scholar
Mikkel Thorup. 2013. Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing. ACM, 371–380. Google ScholarDigital Library
F. Topsoe. 2007.Google Scholar
Some Bounds for the Logarithmic Function. Vol. 4. Nova Science, 137–151.Google Scholar

Index Terms

Set similarity search beyond MinHash
1. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Sorting and searching
  2. Randomness, geometry and discrete structures

Recommendations

Index Structures for Fast Similarity Search for Binary Vectors

This article reviews index structures for fast similarity search for objects represented by binary vectors (with components equal to 0 or 1). Structures for both exact and approximate search by Hamming distance and other similarity measures are ...
Read More
I/O Efficient Algorithm for c-Approximate Furthest Neighbor Search in High-Dimensional Space
Database Systems for Advanced Applications
Abstract
Furthest Neighbor search in high-dimensional space has been widely used in many applications such as recommendation systems. Because of the “curse of dimensionality” problem, c-approximate furthest neighbor (C-AFN) is a substitute as a trade-off ...
Read More
Index Structures for Fast Similarity Search for Real Vectors. II*

This survey article considers index structures for fast similarity search for objects represented by real-valued vectors. Structures for both exact and faster but approximate similarity search are considered. Index structures based on partitioning into ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
June 2017
1268 pages
ISBN:9781450345286
DOI:10.1145/3055399
General Chairs:
Hamed Hatami
McGill University
,
Pierre McKenzie
Université de Montréal
,
Program Chair:
Valerie King
University of Victoria
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 June 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Braun-Blanquet similarity
Jaccard similarity
Locality-sensitive hashing
MinHash
Near neighbor search
Similarity search
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,469of4,586submissions,32%
Upcoming Conference
STOC '24

Sponsor:

sigact

56th Annual ACM Symposium on Theory of Computing (STOC 2024)

June 24 - 28, 2024

Vancouver , BC , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 349
  Total Downloads
- Downloads (Last 12 months)62
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Set similarity search beyond MinHash

STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Index Structures for Fast Similarity Search for Binary Vectors

I/O Efficient Algorithm for c-Approximate Furthest Neighbor Search in High-Dimensional Space

Index Structures for Fast Similarity Search for Real Vectors. II*