poster

Fast locality-sensitive hashing

Authors:
Anirban Dasgupta

Yahoo!, Sunnyvale, CA, USA

Yahoo!, Sunnyvale, CA, USA
View Profile

,
Ravi Kumar

Yahoo!, Sunnyvale, CA, USA

Yahoo!, Sunnyvale, CA, USA
View Profile

,
Tamas Sarlos

Yahoo!, Sunnyvale, CA, USA

Yahoo!, Sunnyvale, CA, USA
View Profile

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2011Pages 1073–1081https://doi.org/10.1145/2020408.2020578

Published:21 August 2011Publication History

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1073–1081

ABSTRACT

Locality-sensitive hashing (LSH) is a basic primitive in several large-scale data processing applications, including nearest-neighbor search, de-duplication, clustering, etc. In this paper we propose a new and simple method to speed up the widely-used Euclidean realization of LSH. At the heart of our method is a fast way to estimate the Euclidean distance between two d-dimensional vectors; this is achieved by the use of randomized Hadamard transforms in a non-linear setting. This decreases the running time of a (k, L)-parameterized LSH from O(dkL) to O(dlog d + kL). Our experiments show that using the new LSH in nearest-neighbor applications can improve their running times by significant amounts. To the best of our knowledge, this is the first running time improvement to LSH that is both provable and practical.

References

N. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302--322, 2009. Google ScholarDigital Library
N. Ailon and E. Liberty. Fast dimension reduction using Rademacher series on dual BCH codes. Discrete and Computational Geometry, 42(4):615--630, 2009. Google ScholarDigital Library
A. Andoni. Nearest Neighbor Search: the Old, the New, and the Impossible. PhD thesis, MIT, 2009.Google Scholar
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008. Google ScholarDigital Library
Y. Bachrach and E. Porat. Fast pseudo-random fingerprints. Arxiv preprint arXiv:1009.5791, 2010.Google Scholar
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proc. 16th WWW, pages 131--140, 2007. Google ScholarDigital Library
A. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proc. 6th WWW, pages 391--404, 1997. Google ScholarDigital Library
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630--659, 2000. Google ScholarDigital Library
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th STOC, pages 380--388, 2002. Google ScholarDigital Library
S. Chien and N. Immorlica. Semantic similarity between search engine queries using temporal correlation. In Proc. 14th WWW, pages 2--11, 2005. Google ScholarDigital Library
S. Danziger, J. Zeng, Y. Wang, R. Brachmann, and R. Lathrop. Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants. Bioinformatics, 23(13):i104, 2007. Google ScholarDigital Library
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. 20th SOCG, pages 253--262, 2004. Google ScholarDigital Library
K. Eshghi and S. Rajaram. Locality sensitive hash functions based on concomitant rank order statistics. In Proc. 14th KDD, pages 221--229, 2008. Google ScholarDigital Library
G. Feigenblat, E. Porat, and A. Shiftan. Even better framework for min-wise based algorithms. Arxiv preprint arXiv:1102.3537, 2011.Google Scholar
I. K. Fodor. A survey of dimension reduction techniques. Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory, 2002.Google ScholarCross Ref
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. 25th VLDB, pages 518--529, 1999. Google ScholarDigital Library
M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115--1145, 1995. Google ScholarDigital Library
J. He, W. Liu, and S. Chang. Scalable similarity search with optimized kernel hashing. In Proc. 16th KDD, pages 1129--1138, 2010. Google ScholarDigital Library
M. R. Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proc. 29th SIGIR, pages 284--291, 2006. Google ScholarDigital Library
W. Hoeffding. Probability inequalities for sums of bounded random variables. J. ASA, 58(301):13--30, 1963.Google Scholar
R. Horn and C. Johnson. Matrix analysis. Cambridge Univ Press, 1990. Google ScholarDigital Library
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th STOC, pages 604--613, 1998. Google ScholarDigital Library
P. Indyk, R. Motwani, P. Raghavan, and S. Vempala. Locality-preserving hashing in multidimensional spaces. In Proc. 29th STOC, pages 618--625, 1997. Google ScholarDigital Library
I. Ipsen and R. Rehman. Perturbation bounds for determinants and characteristic polynomials. SIAM J. Matrix Analysis and Applications, 30(2):762--776, 2008. Google ScholarDigital Library
N. Koudas and D. Srivatsava. Approximate joins: Concepts and techniques. In Proc. 31st VLDB, page 1363, 2005. Google ScholarDigital Library
E. Liberty, N. Ailon, and A. Singer. Dense fast random projections and lean Walsh transforms. In Proc. 12th RANDOM, pages 512--522, 2008. Google ScholarDigital Library
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In Proc. VLDB, pages 950--961, 2007. Google ScholarDigital Library
G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. 16th WWW, pages 141--150, 2007. Google ScholarDigital Library
C. McDiarmid. Concentration. In M. Habib, C. McDiarmid, J. Ramirez-Alfonsin, and B. Reed, editors, Probabilistic Methods for Algorithmic Discrete Mathematics, volume 16, pages 195--248. Springer, 1998.Google ScholarCross Ref
R. Panigrahy. Entropy-based nearest neighbor search in high dimensions. In Proc. 17th SODA, pages 1186--1195, 2006. Google ScholarDigital Library
J. Vybiral. A variant of the Johnson-Lindenstrauss lemma for circulant matrices. J. Functional Analysis, 260(4):1096--1105, 2011.Google ScholarCross Ref
R. Weber, H. Schek, and S. Blott. A quantitative analysis and performance study for similarity search methods in high dimensional spaces. In Proc. 24th VLDB, pages 194--205, 1998. Google ScholarDigital Library

Index Terms

Fast locality-sensitive hashing
1. Information systems
  1. Information retrieval

Recommendations

Lower bounds on locality sensitive hashing
SCG '06: Proceedings of the twenty-second annual symposium on Computational geometry

Given a metric space (X,d_X), c≥1, r>0, and p,q ≡ [0,1], a distribution over mappings H : X → N is called a (r,cr,p,q)-sensitive hash family if any two points in X at distance at most r are mapped by H to the same value with probability at least p, and ...
Read More
Efficient distributed locality sensitive hashing
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Distributed frameworks are gaining increasingly widespread use in applications that process large amounts of data. One important example application is large scale similarity search, for which Locality Sensitive Hashing (LSH) has emerged as the method ...
Read More
Locality-sensitive hashing scheme based on dynamic collision counting
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Locality-Sensitive Hashing (LSH) and its variants are well-known methods for solving the c-approximate NN Search problem in high-dimensional space. Traditionally, several LSH functions are concatenated to form a "static" compound hash function for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2011
1446 pages
ISBN:9781450308137
DOI:10.1145/2020408
General Chair:
Chid Apte
IBM Research
,
Program Chairs:
Joydeep Ghosh
UT Austin
,
Padhraic Smyth
UC Irvine
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dimension reduction
hadamard transform
locality sensitive hashing
nearest neighbour search
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

KDD '24: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 58
  Total Citations
  View Citations
- 1,544
  Total Downloads
- Downloads (Last 12 months)157
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast locality-sensitive hashing

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Lower bounds on locality sensitive hashing

Efficient distributed locality sensitive hashing

Locality-sensitive hashing scheme based on dynamic collision counting