ABSTRACT
We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b1, b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists x Ε P with B(q, x) ≥ b1, then we can efficiently return x′ Ε P with B(q, x′) > b2.
We present a simple data structure that solves this problem with space usage O(n1+ρlogn + ∑x ε P|x|) and query time O(|q|nρ logn) where n = |P| and ρ = log(1/b1)/log(1/b2). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 < b2 < b1 < 1.
In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).
Supplemental Material
- T. D. Ahle, R. Pagh, I. P. Razenshteyn, and F. Silvestri. 2016. On the Complexity of Inner Product Similarity Join. In Proc. PODS’16. 151–164. Google ScholarDigital Library
- A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. 2015. Practical and optimal LSH for angular distance. In Proc. NIPS ’15. 1225–1233. Google ScholarDigital Library
- A. Andoni, P. Indyk, H. L. Nguyen, and I. P. Razenshteyn. 2014. Beyond Locality-Sensitive Hashing. In Proc. SODA ’14. 1018–1028. Google ScholarDigital Library
- A. Andoni, T. Laarhoven, I. P. Razenshteyn, and E. Waingarten. 2017. Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors. In Proc. SODA ’17. 47–66. Google ScholarDigital Library
- A. Andoni and I. Razenshteyn. 2015.Google Scholar
- Optimal Data-Dependent Hashing for Approximate Near Neighbors. In Proc. STOC ’15. 793–801. Google ScholarDigital Library
- A. Andoni and I. Razensteyn. 2016. Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing. In Proc. SoCG ’16. 9:1–9:11.Google Scholar
- Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. 2006. Efficient exact setsimilarity joins. In Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, 918–929. Google ScholarDigital Library
- Roberto J Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007.Google Scholar
- Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web. ACM, 131–140. Google ScholarDigital Library
- A. Becker, L. Ducas, N. Gama, and T. Laarhoven. 2016. New directions in nearest neighbor searching with applications to lattice sieving. In Proc. SODA ’16. 10–24. Google ScholarDigital Library
- Josias Braun-Blanquet. 1932.Google Scholar
- Plant sociology. The study of plant communities. McGraw-Hill.Google Scholar
- Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings. IEEE, 21–29. Google ScholarDigital Library
- Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. Computer Networks and ISDN Systems 29, 8 (1997), 1157–1166. Google ScholarDigital Library
- M. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proc. STOC ’02. 380–388. Google ScholarDigital Library
- F. Chierichetti and R. Kumar. 2015. LSH-Preserving Functions and Their Applications. J. ACM 62, 5 (2015), 33. Google ScholarDigital Library
- S. Choi, S. Cha, and C. C. Tappert. 2010.Google Scholar
- A survey of binary similarity and distance measures. J. Syst. Cybern. Informatics 8, 1 (2010), 43–48.Google Scholar
- T. Christiani. 2017. A Framework for Similarity Search with Space-Time Tradeoffs using Locality-Sensitive Filtering. In Proc. SODA ’17. 31–46. Google ScholarDigital Library
- E. Cohen. 1997. Size-estimation framework with applications to transitive closure and reachability. J. Comp. Syst. Sci. 55, 3 (1997), 441–453. Google ScholarDigital Library
- E. Cohen and H. Kaplan. 2009.Google Scholar
- Leveraging discarded samples for tighter estimation of multiple-set aggregates. ACM SIGMETRICS Performance Evaluation Review 37, 1 (2009), 251–262. Google ScholarDigital Library
- M. Dubiner. 2010. Bucketing coding and information theory for the statistical high-dimensional nearest-neighbor problem. IEEE Trans. Information Theory 56, 8 (2010), 4166–4179. Google ScholarDigital Library
- T. Hagerup. 1998. Sorting and Searching on the Word RAM. In Proc. STACS ’98. 366–398. Google ScholarDigital Library
- S. Har-Peled, P. Indyk, and R. Motwani. 2012. Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality. Theory of computing 8, 1 (2012), 321–350.Google Scholar
- W. Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Jour. Am. Stat. Assoc. 58, 301 (1963), 13–30.Google ScholarCross Ref
- P. Indyk and R. Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proc. STOC ’98. 604–613. Google ScholarDigital Library
- T. Laarhoven. 2015.Google Scholar
- Tradeoffs for nearest neighbors on the sphere. CoRR abs/1511.07527 (2015). http://arxiv.org/abs/1511.07527Google Scholar
- Ping Li and Arnd Christian König. 2011. Theory and applications of b-bit minwise hashing. Commun. ACM 54, 8 (2011), 101–109. Google ScholarDigital Library
- M. Mitzenmacher, R. Pagh, and N. Pham. 2014.Google Scholar
- Efficient estimation for high similarities using odd sketches. In Proc. WWW ’14. 109–118. Google ScholarDigital Library
- M. Mitzenmacher and E. Upfal. 2005.Google Scholar
- Probability and computing. Cambridge University Press, New York, NY.Google Scholar
- R. Motwani, A. Naor, and R. Panigrahy. 2007. Lower Bounds on Locality Sensitive Hashing. SIAM J. Discrete Math. 21, 4 (2007), 930–935. Google ScholarDigital Library
- Rajeev Motwani and Prabhakar Raghavan. 2010.Google Scholar
- Randomized algorithms. Chapman & Hall/CRC.Google Scholar
- R. O’Donnell, Y. Wu, and Y. Zhou. 2014.Google Scholar
- Optimal lower bounds for localitysensitive hashing (except when q is tiny). ACM Transactions on Computation Theory (TOCT) 6, 1 (2014), 5. Google ScholarDigital Library
- Rasmus Pagh, Morten Stöckel, and David P Woodruff. 2014. Is min-wise hashing optimal for summarizing set intersection?. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 109–120. Google ScholarDigital Library
- R. Panigrahy, K. Talwar, and U. Wieder. 2010. Lower Bounds on Near Neighbor Search via Metric Expansion. In Proc. FOCS ’10. 805–814. Google ScholarDigital Library
- Anshumali Shrivastava and Ping Li. 2015.Google Scholar
- Asymmetric minwise hashing for indexing binary inner products and set containment. In Proceedings of the 24th International Conference on World Wide Web. ACM, 981–991. Google ScholarDigital Library
- K. Terasawa and Y. Tanaka. 2007.Google Scholar
- Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere. In Proc. WADS ’07. 27–38.Google Scholar
- Mikkel Thorup. 2013. Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing. ACM, 371–380. Google ScholarDigital Library
- F. Topsoe. 2007.Google Scholar
- Some Bounds for the Logarithmic Function. Vol. 4. Nova Science, 137–151.Google Scholar
Index Terms
- Set similarity search beyond MinHash
Recommendations
Index Structures for Fast Similarity Search for Binary Vectors
This article reviews index structures for fast similarity search for objects represented by binary vectors (with components equal to 0 or 1). Structures for both exact and approximate search by Hamming distance and other similarity measures are ...
I/O Efficient Algorithm for c-Approximate Furthest Neighbor Search in High-Dimensional Space
Database Systems for Advanced ApplicationsAbstractFurthest Neighbor search in high-dimensional space has been widely used in many applications such as recommendation systems. Because of the “curse of dimensionality” problem, c-approximate furthest neighbor (C-AFN) is a substitute as a trade-off ...
Index Structures for Fast Similarity Search for Real Vectors. II*
This survey article considers index structures for fast similarity search for objects represented by real-valued vectors. Structures for both exact and faster but approximate similarity search are considered. Index structures based on partitioning into ...
Comments