skip to main content
10.1145/3055399.3055443acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article

Set similarity search beyond MinHash

Published:19 June 2017Publication History

ABSTRACT

We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |xy| / max(|x|, |y|). The (b1, b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists x Ε P with B(q, x) ≥ b1, then we can efficiently return x′ Ε P with B(q, x′) > b2.

We present a simple data structure that solves this problem with space usage O(n1+ρlogn + ∑x ε P|x|) and query time O(|q|nρ logn) where n = |P| and ρ = log(1/b1)/log(1/b2). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 < b2 < b1 < 1.

In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).

Skip Supplemental Material Section

Supplemental Material

d4_sc_t2.mp4

mp4

148.8 MB

References

  1. T. D. Ahle, R. Pagh, I. P. Razenshteyn, and F. Silvestri. 2016. On the Complexity of Inner Product Similarity Join. In Proc. PODS’16. 151–164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. 2015. Practical and optimal LSH for angular distance. In Proc. NIPS ’15. 1225–1233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Andoni, P. Indyk, H. L. Nguyen, and I. P. Razenshteyn. 2014. Beyond Locality-Sensitive Hashing. In Proc. SODA ’14. 1018–1028. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Andoni, T. Laarhoven, I. P. Razenshteyn, and E. Waingarten. 2017. Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors. In Proc. SODA ’17. 47–66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Andoni and I. Razenshteyn. 2015.Google ScholarGoogle Scholar
  6. Optimal Data-Dependent Hashing for Approximate Near Neighbors. In Proc. STOC ’15. 793–801. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Andoni and I. Razensteyn. 2016. Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing. In Proc. SoCG ’16. 9:1–9:11.Google ScholarGoogle Scholar
  8. Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. 2006. Efficient exact setsimilarity joins. In Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, 918–929. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Roberto J Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007.Google ScholarGoogle Scholar
  10. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web. ACM, 131–140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Becker, L. Ducas, N. Gama, and T. Laarhoven. 2016. New directions in nearest neighbor searching with applications to lattice sieving. In Proc. SODA ’16. 10–24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Josias Braun-Blanquet. 1932.Google ScholarGoogle Scholar
  13. Plant sociology. The study of plant communities. McGraw-Hill.Google ScholarGoogle Scholar
  14. Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings. IEEE, 21–29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. Computer Networks and ISDN Systems 29, 8 (1997), 1157–1166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proc. STOC ’02. 380–388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Chierichetti and R. Kumar. 2015. LSH-Preserving Functions and Their Applications. J. ACM 62, 5 (2015), 33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Choi, S. Cha, and C. C. Tappert. 2010.Google ScholarGoogle Scholar
  19. A survey of binary similarity and distance measures. J. Syst. Cybern. Informatics 8, 1 (2010), 43–48.Google ScholarGoogle Scholar
  20. T. Christiani. 2017. A Framework for Similarity Search with Space-Time Tradeoffs using Locality-Sensitive Filtering. In Proc. SODA ’17. 31–46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. Cohen. 1997. Size-estimation framework with applications to transitive closure and reachability. J. Comp. Syst. Sci. 55, 3 (1997), 441–453. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. Cohen and H. Kaplan. 2009.Google ScholarGoogle Scholar
  23. Leveraging discarded samples for tighter estimation of multiple-set aggregates. ACM SIGMETRICS Performance Evaluation Review 37, 1 (2009), 251–262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Dubiner. 2010. Bucketing coding and information theory for the statistical high-dimensional nearest-neighbor problem. IEEE Trans. Information Theory 56, 8 (2010), 4166–4179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Hagerup. 1998. Sorting and Searching on the Word RAM. In Proc. STACS ’98. 366–398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Har-Peled, P. Indyk, and R. Motwani. 2012. Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality. Theory of computing 8, 1 (2012), 321–350.Google ScholarGoogle Scholar
  27. W. Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Jour. Am. Stat. Assoc. 58, 301 (1963), 13–30.Google ScholarGoogle ScholarCross RefCross Ref
  28. P. Indyk and R. Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proc. STOC ’98. 604–613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Laarhoven. 2015.Google ScholarGoogle Scholar
  30. Tradeoffs for nearest neighbors on the sphere. CoRR abs/1511.07527 (2015). http://arxiv.org/abs/1511.07527Google ScholarGoogle Scholar
  31. Ping Li and Arnd Christian König. 2011. Theory and applications of b-bit minwise hashing. Commun. ACM 54, 8 (2011), 101–109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Mitzenmacher, R. Pagh, and N. Pham. 2014.Google ScholarGoogle Scholar
  33. Efficient estimation for high similarities using odd sketches. In Proc. WWW ’14. 109–118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Mitzenmacher and E. Upfal. 2005.Google ScholarGoogle Scholar
  35. Probability and computing. Cambridge University Press, New York, NY.Google ScholarGoogle Scholar
  36. R. Motwani, A. Naor, and R. Panigrahy. 2007. Lower Bounds on Locality Sensitive Hashing. SIAM J. Discrete Math. 21, 4 (2007), 930–935. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Rajeev Motwani and Prabhakar Raghavan. 2010.Google ScholarGoogle Scholar
  38. Randomized algorithms. Chapman &amp; Hall/CRC.Google ScholarGoogle Scholar
  39. R. O’Donnell, Y. Wu, and Y. Zhou. 2014.Google ScholarGoogle Scholar
  40. Optimal lower bounds for localitysensitive hashing (except when q is tiny). ACM Transactions on Computation Theory (TOCT) 6, 1 (2014), 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Rasmus Pagh, Morten Stöckel, and David P Woodruff. 2014. Is min-wise hashing optimal for summarizing set intersection?. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 109–120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. R. Panigrahy, K. Talwar, and U. Wieder. 2010. Lower Bounds on Near Neighbor Search via Metric Expansion. In Proc. FOCS ’10. 805–814. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Anshumali Shrivastava and Ping Li. 2015.Google ScholarGoogle Scholar
  44. Asymmetric minwise hashing for indexing binary inner products and set containment. In Proceedings of the 24th International Conference on World Wide Web. ACM, 981–991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. K. Terasawa and Y. Tanaka. 2007.Google ScholarGoogle Scholar
  46. Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere. In Proc. WADS ’07. 27–38.Google ScholarGoogle Scholar
  47. Mikkel Thorup. 2013. Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing. ACM, 371–380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. F. Topsoe. 2007.Google ScholarGoogle Scholar
  49. Some Bounds for the Logarithmic Function. Vol. 4. Nova Science, 137–151.Google ScholarGoogle Scholar

Index Terms

  1. Set similarity search beyond MinHash

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
        June 2017
        1268 pages
        ISBN:9781450345286
        DOI:10.1145/3055399

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 June 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,469of4,586submissions,32%

        Upcoming Conference

        STOC '24
        56th Annual ACM Symposium on Theory of Computing (STOC 2024)
        June 24 - 28, 2024
        Vancouver , BC , Canada

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader