ABSTRACT
(MATH) A locality sensitive hashing scheme is a distribution on a family $\F$ of hash functions operating on a collection of objects, such that for two objects x,y, PrhεF[h(x) = h(y)] = sim(x,y), where sim(x,y) ε [0,1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A,B) = \frac{|A ∩ B|}{|A ∪ B|}.(MATH) We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for:
A collection of vectors with the distance between → \over u and → \over v measured by Ø(→ \over u, → \over v)/π, where Ø(→ \over u, → \over v) is the angle between → \over u) and → \over v). This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity.
A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q, EMD(P,Q) ≤ Ehε\F [d(h(P),h(Q))] ≤ O(log n log log n). EMD(P, Q).
- N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking Join and Self-Join Sizes in Limited Storage. Proc. 18th PODS pp. 10-20, 1999. Google ScholarDigital Library
- N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. JCSS 58(1): 137-147, 1999. Google ScholarDigital Library
- Y. Bartal. Probabilistic approximation of metric spaces and its algorithmic application. Proc. 37th FOCS, pages 184--193, 1996. Google ScholarDigital Library
- Y. Bartal. On approximating arbitrary metrics by tree metrics. In Proc. 30th STOC, pages 161--168, 1998. Google ScholarDigital Library
- A. Z. Broder. On the resemblance and containment of documents. Proc. Compression and Complexity of SEQUENCES, pp. 21--29. IEEE Computer Society, 1997. Google ScholarDigital Library
- A. Z. Broder. Filtering near-duplicate documents. Proc. FUN 98, 1998.Google Scholar
- A. Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Proc. 30th STOC, pp. 327--336, 1998. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. Proc. 6th Int'l World Wide Web Conference, pp. 391--404, 1997. Google ScholarDigital Library
- A. Z. Broder, R. Krauthgamer, and M. Mitzenmacher. Improved classification via connectivity information. Proc. 11th SODA, pp. 576--585, 2000. Google ScholarDigital Library
- G. Calinescu, H. J. Karloff, and Y. Rabani. Approximation algorithms for the 0-extension problem. Proc. 11th SODA, pp. 8--16, 2000. Google ScholarDigital Library
- C. Chekuri, S. Khanna, J. Naor, and L. Zosin. Approximation algorithms for the metric labeling problem via a new linear programming formulation. Proc. 12th SODA, pp. 109--118, 2001. Google ScholarDigital Library
- Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting Twig Matches in a Tree. Proc. 17th ICDE pp. 595--604. 2001. Google ScholarDigital Library
- Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. Selectivity Estimation for Boolean Queries. Proc. 19th PODS, pp. 216--225, 2000. Google ScholarDigital Library
- E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding Interesting Associations without Support Pruning. Proc. 16th ICDE pp. 489--499, 2000. Google ScholarDigital Library
- S. Cohen and L. Guibas. The Earth Mover's Distance under Transformation Sets. Proc. 7th IEEE Intnl. Conf. Computer Vision, 1999. Google ScholarDigital Library
- S. Cohen and L. Guibas. The Earth Mover's Distance: Lower Bounds and Invariance under Translation. Tech. report STAN-CS-TR-97-1597, Dept. of Computer Science, Stanford University, 1997. Google ScholarDigital Library
- L. Engebretsen, P. Indyk and R. O'Donnell. Derandomized dimensionality reduction with applications. To appear in Proc. 13th SODA, 2002. Google ScholarDigital Library
- P. B. Gibbons and Y. Matias. Synopsis Data Structures for Massive Data Sets. Proc. 10th SODA pp. 909--910, 1999. Google ScholarDigital Library
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. Proc. 27th VLDB pp. 79--88, 2001. Google ScholarDigital Library
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. QuickSAND: Quick Summary and Analysis of Network Data. DIMACS Technical Report 2001-43, November 2001.Google Scholar
- A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Fast, Small-Space Algorithms for Approximate Histogram Maintenance. these proceedings. Google ScholarDigital Library
- A. Gionis, D. Gunopulos, and N. Koudas. Efficient and Tunable Similar Set Retrieval. Proc. SIGMOD Conference 2001. Google ScholarDigital Library
- A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. Proc. 25th VLDB} pp. 518--529, 1999. Google ScholarDigital Library
- M. X. Goemans and D. P. Williamson. Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems Using Semidefinite Programming. JACM 42(6): 1115--1145, 1995. Google ScholarDigital Library
- S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. Proc. 41st FOCS, pp. 359--366, 2000. Google ScholarDigital Library
- A. Gupta and Éva Tardos. A constant factor approximation algorithm for a class of classification problems. Proc. 32nd STOC, pp. 652--658, 2000. Google ScholarDigital Library
- T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. Proc. 3rd WebDB, pp. 129--134, 2000.Google Scholar
- P. Indyk. A small approximately min-wise independent family of hash functions. Proc. 10th SODA, pp. 454--456, 1999. Google ScholarDigital Library
- P. Indyk. On approximate nearest neighbors in non-Euclidean spaces. Proc. 40th FOCS, pp. 148--155, 1999. Google ScholarDigital Library
- P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. Proc. 41st FOCS, 189-197, 2000. Google ScholarDigital Library
- Indyk, P., Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. Proc. 30th STOC pp. 604--613, 1998. Google ScholarDigital Library
- P. Indyk, R. Motwani, P. Raghavan, and S. Vempala. Locality-Preserving Hashing in Multidimensional Spaces. Proc. 29th STOC, pp. 618--625, 1997. Google ScholarDigital Library
- J. M. Kleinberg and Éva Tardos Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields. Proc. 40th FOCS, pp. 14--23, 1999. Google ScholarDigital Library
- N. Linial and O. Sasson. Non-Expansive Hashing. Combinatorica 18(1): 121--132, 1998.Google ScholarCross Ref
- N. Nisan. Pseudorandom sequences for space bounded computations. Combinatorica, 12:449--461, 1992.Google ScholarCross Ref
- Y. Rubner. Perceptual Metrics for Image Database Navigation. Phd Thesis, Stanford University, May 1999. Google ScholarDigital Library
- Y. Rubner, L. J. Guibas, and C. Tomasi. The Earth Mover's Distance, Multi-Dimensional Scaling, and Color-Based Image Retrieval. Proc. of the ARPA Image Understanding Workshop, pp. 661--668, 1997.Google Scholar
- Y. Rubner, C. Tomasi. Texture Metrics. Proc. IEEE International Conference on Systems, Man, and Cybernetics, 1998, pp. 4601--4607.Google Scholar
- Y. Rubner, C. Tomasi, and L. J. Guibas. A Metric for Distributions with Applications to Image Databases. Proc. IEEE Int. Conf. on Computer Vision, pp. 59--66, 1998. Google ScholarDigital Library
- Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth Mover's Distance as a Metric for Image Retrieval. Tech. Report STAN-CS-TN-98-86, Dept. of Computer Science, Stanford University, 1998. Google ScholarDigital Library
- M. Ruzon and C. Tomasi. Color edge detection with the compass operator. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2:160--166, 1999.Google ScholarCross Ref
- M. Ruzon and C. Tomasi. Corner detection in textured color images. Proc. IEEE Int. Conf. Computer Vision, 2:1039--1045, 1999. Google ScholarDigital Library
Index Terms
- Similarity estimation techniques from rounding algorithms
Recommendations
Learning similarity with cosine similarity ensemble
This paper proposes a cosine similarity ensemble (CSE) method to learn similarity.CSE is a selective ensemble and combines multiple cosine similarity learners.A learner redefines the pattern vectors and determines its threshold adaptively.Experimental ...
Distance Learning for Similarity Estimation
In this paper, we present a general guideline to find a better distance measure for similarity estimation based on statistical analysis of distribution models and distance functions. A new set of distance measures are derived from the harmonic distance, ...
Improving performance of similarity measures for uncertain time series using preprocessing techniques
SSDBM '15: Proceedings of the 27th International Conference on Scientific and Statistical Database ManagementWe study the impact of preprocessing techniques on performance and effectiveness of the similarity measures for uncertain time series. Some existing work on uncertain time series use the same similarity measures developed for standard time series, to ...
Comments