skip to main content
10.1145/509907.509965acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
Article

Similarity estimation techniques from rounding algorithms

Published:19 May 2002Publication History

ABSTRACT

(MATH) A locality sensitive hashing scheme is a distribution on a family $\F$ of hash functions operating on a collection of objects, such that for two objects x,y, PrhεF[h(x) = h(y)] = sim(x,y), where sim(x,y) ε [0,1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A,B) = \frac{|A ∩ B|}{|A ∪ B|}.(MATH) We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for:

  • A collection of vectors with the distance between → \over u and → \over v measured by Ø(→ \over u, → \over v)/π, where Ø(→ \over u, → \over v) is the angle between → \over u) and → \over v). This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity.

  • A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q, EMD(P,Q) ≤ Ehε\F [d(h(P),h(Q))] ≤ O(log n log log n). EMD(P, Q).

References

  1. N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking Join and Self-Join Sizes in Limited Storage. Proc. 18th PODS pp. 10-20, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. JCSS 58(1): 137-147, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Bartal. Probabilistic approximation of metric spaces and its algorithmic application. Proc. 37th FOCS, pages 184--193, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Bartal. On approximating arbitrary metrics by tree metrics. In Proc. 30th STOC, pages 161--168, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Z. Broder. On the resemblance and containment of documents. Proc. Compression and Complexity of SEQUENCES, pp. 21--29. IEEE Computer Society, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Z. Broder. Filtering near-duplicate documents. Proc. FUN 98, 1998.Google ScholarGoogle Scholar
  7. A. Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Proc. 30th STOC, pp. 327--336, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. Proc. 6th Int'l World Wide Web Conference, pp. 391--404, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Z. Broder, R. Krauthgamer, and M. Mitzenmacher. Improved classification via connectivity information. Proc. 11th SODA, pp. 576--585, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Calinescu, H. J. Karloff, and Y. Rabani. Approximation algorithms for the 0-extension problem. Proc. 11th SODA, pp. 8--16, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Chekuri, S. Khanna, J. Naor, and L. Zosin. Approximation algorithms for the metric labeling problem via a new linear programming formulation. Proc. 12th SODA, pp. 109--118, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting Twig Matches in a Tree. Proc. 17th ICDE pp. 595--604. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. Selectivity Estimation for Boolean Queries. Proc. 19th PODS, pp. 216--225, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding Interesting Associations without Support Pruning. Proc. 16th ICDE pp. 489--499, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Cohen and L. Guibas. The Earth Mover's Distance under Transformation Sets. Proc. 7th IEEE Intnl. Conf. Computer Vision, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Cohen and L. Guibas. The Earth Mover's Distance: Lower Bounds and Invariance under Translation. Tech. report STAN-CS-TR-97-1597, Dept. of Computer Science, Stanford University, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Engebretsen, P. Indyk and R. O'Donnell. Derandomized dimensionality reduction with applications. To appear in Proc. 13th SODA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. B. Gibbons and Y. Matias. Synopsis Data Structures for Massive Data Sets. Proc. 10th SODA pp. 909--910, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. Proc. 27th VLDB pp. 79--88, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. QuickSAND: Quick Summary and Analysis of Network Data. DIMACS Technical Report 2001-43, November 2001.Google ScholarGoogle Scholar
  21. A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Fast, Small-Space Algorithms for Approximate Histogram Maintenance. these proceedings. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Gionis, D. Gunopulos, and N. Koudas. Efficient and Tunable Similar Set Retrieval. Proc. SIGMOD Conference 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. Proc. 25th VLDB} pp. 518--529, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. X. Goemans and D. P. Williamson. Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems Using Semidefinite Programming. JACM 42(6): 1115--1145, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. Proc. 41st FOCS, pp. 359--366, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Gupta and Éva Tardos. A constant factor approximation algorithm for a class of classification problems. Proc. 32nd STOC, pp. 652--658, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. Proc. 3rd WebDB, pp. 129--134, 2000.Google ScholarGoogle Scholar
  28. P. Indyk. A small approximately min-wise independent family of hash functions. Proc. 10th SODA, pp. 454--456, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Indyk. On approximate nearest neighbors in non-Euclidean spaces. Proc. 40th FOCS, pp. 148--155, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. Proc. 41st FOCS, 189-197, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Indyk, P., Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. Proc. 30th STOC pp. 604--613, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Indyk, R. Motwani, P. Raghavan, and S. Vempala. Locality-Preserving Hashing in Multidimensional Spaces. Proc. 29th STOC, pp. 618--625, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. M. Kleinberg and Éva Tardos Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields. Proc. 40th FOCS, pp. 14--23, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. N. Linial and O. Sasson. Non-Expansive Hashing. Combinatorica 18(1): 121--132, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  35. N. Nisan. Pseudorandom sequences for space bounded computations. Combinatorica, 12:449--461, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  36. Y. Rubner. Perceptual Metrics for Image Database Navigation. Phd Thesis, Stanford University, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Y. Rubner, L. J. Guibas, and C. Tomasi. The Earth Mover's Distance, Multi-Dimensional Scaling, and Color-Based Image Retrieval. Proc. of the ARPA Image Understanding Workshop, pp. 661--668, 1997.Google ScholarGoogle Scholar
  38. Y. Rubner, C. Tomasi. Texture Metrics. Proc. IEEE International Conference on Systems, Man, and Cybernetics, 1998, pp. 4601--4607.Google ScholarGoogle Scholar
  39. Y. Rubner, C. Tomasi, and L. J. Guibas. A Metric for Distributions with Applications to Image Databases. Proc. IEEE Int. Conf. on Computer Vision, pp. 59--66, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth Mover's Distance as a Metric for Image Retrieval. Tech. Report STAN-CS-TN-98-86, Dept. of Computer Science, Stanford University, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. Ruzon and C. Tomasi. Color edge detection with the compass operator. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2:160--166, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  42. M. Ruzon and C. Tomasi. Corner detection in textured color images. Proc. IEEE Int. Conf. Computer Vision, 2:1039--1045, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Similarity estimation techniques from rounding algorithms

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          STOC '02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
          May 2002
          840 pages
          ISBN:1581134959
          DOI:10.1145/509907

          Copyright © 2002 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 May 2002

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          STOC '02 Paper Acceptance Rate91of287submissions,32%Overall Acceptance Rate1,469of4,586submissions,32%

          Upcoming Conference

          STOC '24
          56th Annual ACM Symposium on Theory of Computing (STOC 2024)
          June 24 - 28, 2024
          Vancouver , BC , Canada

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader