Article

Similarity estimation techniques from rounding algorithms

Author:
Moses S. Charikar

Princeton University, Princeton, NJ

Princeton University, Princeton, NJ
View Profile

STOC '02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computingMay 2002Pages 380–388https://doi.org/10.1145/509907.509965

Published:19 May 2002Publication History

STOC '02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing

Pages 380–388

ABSTRACT

(MATH) A locality sensitive hashing scheme is a distribution on a family $\F$ of hash functions operating on a collection of objects, such that for two objects x,y, Pr_hεF[h(x) = h(y)] = sim(x,y), where sim(x,y) ε [0,1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A,B) = \frac{|A ∩ B|}{|A ∪ B|}.(MATH) We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for:

A collection of vectors with the distance between → \over u and → \over v measured by Ø(→ \over u, → \over v)/π, where Ø(→ \over u, → \over v) is the angle between → \over u) and → \over v). This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity.
A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q, EMD(P,Q) ≤ E_hε\F [d(h(P),h(Q))] ≤ O(log n log log n). EMD(P, Q).

References

N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking Join and Self-Join Sizes in Limited Storage. Proc. 18th PODS pp. 10-20, 1999. Google ScholarDigital Library
N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. JCSS 58(1): 137-147, 1999. Google ScholarDigital Library
Y. Bartal. Probabilistic approximation of metric spaces and its algorithmic application. Proc. 37th FOCS, pages 184--193, 1996. Google ScholarDigital Library
Y. Bartal. On approximating arbitrary metrics by tree metrics. In Proc. 30th STOC, pages 161--168, 1998. Google ScholarDigital Library
A. Z. Broder. On the resemblance and containment of documents. Proc. Compression and Complexity of SEQUENCES, pp. 21--29. IEEE Computer Society, 1997. Google ScholarDigital Library
A. Z. Broder. Filtering near-duplicate documents. Proc. FUN 98, 1998.Google Scholar
A. Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Proc. 30th STOC, pp. 327--336, 1998. Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. Proc. 6th Int'l World Wide Web Conference, pp. 391--404, 1997. Google ScholarDigital Library
A. Z. Broder, R. Krauthgamer, and M. Mitzenmacher. Improved classification via connectivity information. Proc. 11th SODA, pp. 576--585, 2000. Google ScholarDigital Library
G. Calinescu, H. J. Karloff, and Y. Rabani. Approximation algorithms for the 0-extension problem. Proc. 11th SODA, pp. 8--16, 2000. Google ScholarDigital Library
C. Chekuri, S. Khanna, J. Naor, and L. Zosin. Approximation algorithms for the metric labeling problem via a new linear programming formulation. Proc. 12th SODA, pp. 109--118, 2001. Google ScholarDigital Library
Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting Twig Matches in a Tree. Proc. 17th ICDE pp. 595--604. 2001. Google ScholarDigital Library
Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. Selectivity Estimation for Boolean Queries. Proc. 19th PODS, pp. 216--225, 2000. Google ScholarDigital Library
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding Interesting Associations without Support Pruning. Proc. 16th ICDE pp. 489--499, 2000. Google ScholarDigital Library
S. Cohen and L. Guibas. The Earth Mover's Distance under Transformation Sets. Proc. 7th IEEE Intnl. Conf. Computer Vision, 1999. Google ScholarDigital Library
S. Cohen and L. Guibas. The Earth Mover's Distance: Lower Bounds and Invariance under Translation. Tech. report STAN-CS-TR-97-1597, Dept. of Computer Science, Stanford University, 1997. Google ScholarDigital Library
L. Engebretsen, P. Indyk and R. O'Donnell. Derandomized dimensionality reduction with applications. To appear in Proc. 13th SODA, 2002. Google ScholarDigital Library
P. B. Gibbons and Y. Matias. Synopsis Data Structures for Massive Data Sets. Proc. 10th SODA pp. 909--910, 1999. Google ScholarDigital Library
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. Proc. 27th VLDB pp. 79--88, 2001. Google ScholarDigital Library
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. QuickSAND: Quick Summary and Analysis of Network Data. DIMACS Technical Report 2001-43, November 2001.Google Scholar
A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Fast, Small-Space Algorithms for Approximate Histogram Maintenance. these proceedings. Google ScholarDigital Library
A. Gionis, D. Gunopulos, and N. Koudas. Efficient and Tunable Similar Set Retrieval. Proc. SIGMOD Conference 2001. Google ScholarDigital Library
A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. Proc. 25th VLDB} pp. 518--529, 1999. Google ScholarDigital Library
M. X. Goemans and D. P. Williamson. Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems Using Semidefinite Programming. JACM 42(6): 1115--1145, 1995. Google ScholarDigital Library
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. Proc. 41st FOCS, pp. 359--366, 2000. Google ScholarDigital Library
A. Gupta and Éva Tardos. A constant factor approximation algorithm for a class of classification problems. Proc. 32nd STOC, pp. 652--658, 2000. Google ScholarDigital Library
T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. Proc. 3rd WebDB, pp. 129--134, 2000.Google Scholar
P. Indyk. A small approximately min-wise independent family of hash functions. Proc. 10th SODA, pp. 454--456, 1999. Google ScholarDigital Library
P. Indyk. On approximate nearest neighbors in non-Euclidean spaces. Proc. 40th FOCS, pp. 148--155, 1999. Google ScholarDigital Library
P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. Proc. 41st FOCS, 189-197, 2000. Google ScholarDigital Library
Indyk, P., Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. Proc. 30th STOC pp. 604--613, 1998. Google ScholarDigital Library
P. Indyk, R. Motwani, P. Raghavan, and S. Vempala. Locality-Preserving Hashing in Multidimensional Spaces. Proc. 29th STOC, pp. 618--625, 1997. Google ScholarDigital Library
J. M. Kleinberg and Éva Tardos Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields. Proc. 40th FOCS, pp. 14--23, 1999. Google ScholarDigital Library
N. Linial and O. Sasson. Non-Expansive Hashing. Combinatorica 18(1): 121--132, 1998.Google ScholarCross Ref
N. Nisan. Pseudorandom sequences for space bounded computations. Combinatorica, 12:449--461, 1992.Google ScholarCross Ref
Y. Rubner. Perceptual Metrics for Image Database Navigation. Phd Thesis, Stanford University, May 1999. Google ScholarDigital Library
Y. Rubner, L. J. Guibas, and C. Tomasi. The Earth Mover's Distance, Multi-Dimensional Scaling, and Color-Based Image Retrieval. Proc. of the ARPA Image Understanding Workshop, pp. 661--668, 1997.Google Scholar
Y. Rubner, C. Tomasi. Texture Metrics. Proc. IEEE International Conference on Systems, Man, and Cybernetics, 1998, pp. 4601--4607.Google Scholar
Y. Rubner, C. Tomasi, and L. J. Guibas. A Metric for Distributions with Applications to Image Databases. Proc. IEEE Int. Conf. on Computer Vision, pp. 59--66, 1998. Google ScholarDigital Library
Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth Mover's Distance as a Metric for Image Retrieval. Tech. Report STAN-CS-TN-98-86, Dept. of Computer Science, Stanford University, 1998. Google ScholarDigital Library
M. Ruzon and C. Tomasi. Color edge detection with the compass operator. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2:160--166, 1999.Google ScholarCross Ref
M. Ruzon and C. Tomasi. Corner detection in textured color images. Proc. IEEE Int. Conf. Computer Vision, 2:1039--1045, 1999. Google ScholarDigital Library

Index Terms

Similarity estimation techniques from rounding algorithms

Recommendations

Learning similarity with cosine similarity ensemble

This paper proposes a cosine similarity ensemble (CSE) method to learn similarity.CSE is a selective ensemble and combines multiple cosine similarity learners.A learner redefines the pattern vectors and determines its threshold adaptively.Experimental ...
Read More
Distance Learning for Similarity Estimation

In this paper, we present a general guideline to find a better distance measure for similarity estimation based on statistical analysis of distribution models and distance functions. A new set of distance measures are derived from the harmonic distance, ...
Read More
Improving performance of similarity measures for uncertain time series using preprocessing techniques
SSDBM '15: Proceedings of the 27th International Conference on Scientific and Statistical Database Management

We study the impact of preprocessing techniques on performance and effectiveness of the similarity measures for uncertain time series. Some existing work on uncertain time series use the same similarity measures developed for standard time series, to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
STOC '02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
May 2002
840 pages
ISBN:1581134959
DOI:10.1145/509907
Conference Chair:
John Reif
Duke University
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 May 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
STOC '02 Paper Acceptance Rate91of287submissions,32%Overall Acceptance Rate1,469of4,586submissions,32%
More
Upcoming Conference
STOC '24

Sponsor:

sigact

56th Annual ACM Symposium on Theory of Computing (STOC 2024)

June 24 - 28, 2024

Vancouver , BC , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,015
  Total Citations
  View Citations
- 1,974
  Total Downloads
- Downloads (Last 12 months)352
- Downloads (Last 6 weeks)39
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Similarity estimation techniques from rounding algorithms

STOC '02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning similarity with cosine similarity ensemble

Distance Learning for Similarity Estimation

Improving performance of similarity measures for uncertain time series using preprocessing techniques