ABSTRACT
Locality-sensitive hashing (LSH) is a basic primitive in several large-scale data processing applications, including nearest-neighbor search, de-duplication, clustering, etc. In this paper we propose a new and simple method to speed up the widely-used Euclidean realization of LSH. At the heart of our method is a fast way to estimate the Euclidean distance between two d-dimensional vectors; this is achieved by the use of randomized Hadamard transforms in a non-linear setting. This decreases the running time of a (k, L)-parameterized LSH from O(dkL) to O(dlog d + kL). Our experiments show that using the new LSH in nearest-neighbor applications can improve their running times by significant amounts. To the best of our knowledge, this is the first running time improvement to LSH that is both provable and practical.
- N. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302--322, 2009. Google ScholarDigital Library
- N. Ailon and E. Liberty. Fast dimension reduction using Rademacher series on dual BCH codes. Discrete and Computational Geometry, 42(4):615--630, 2009. Google ScholarDigital Library
- A. Andoni. Nearest Neighbor Search: the Old, the New, and the Impossible. PhD thesis, MIT, 2009.Google Scholar
- A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008. Google ScholarDigital Library
- Y. Bachrach and E. Porat. Fast pseudo-random fingerprints. Arxiv preprint arXiv:1009.5791, 2010.Google Scholar
- R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proc. 16th WWW, pages 131--140, 2007. Google ScholarDigital Library
- A. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proc. 6th WWW, pages 391--404, 1997. Google ScholarDigital Library
- A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630--659, 2000. Google ScholarDigital Library
- M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th STOC, pages 380--388, 2002. Google ScholarDigital Library
- S. Chien and N. Immorlica. Semantic similarity between search engine queries using temporal correlation. In Proc. 14th WWW, pages 2--11, 2005. Google ScholarDigital Library
- S. Danziger, J. Zeng, Y. Wang, R. Brachmann, and R. Lathrop. Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants. Bioinformatics, 23(13):i104, 2007. Google ScholarDigital Library
- M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. 20th SOCG, pages 253--262, 2004. Google ScholarDigital Library
- K. Eshghi and S. Rajaram. Locality sensitive hash functions based on concomitant rank order statistics. In Proc. 14th KDD, pages 221--229, 2008. Google ScholarDigital Library
- G. Feigenblat, E. Porat, and A. Shiftan. Even better framework for min-wise based algorithms. Arxiv preprint arXiv:1102.3537, 2011.Google Scholar
- I. K. Fodor. A survey of dimension reduction techniques. Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory, 2002.Google ScholarCross Ref
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. 25th VLDB, pages 518--529, 1999. Google ScholarDigital Library
- M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115--1145, 1995. Google ScholarDigital Library
- J. He, W. Liu, and S. Chang. Scalable similarity search with optimized kernel hashing. In Proc. 16th KDD, pages 1129--1138, 2010. Google ScholarDigital Library
- M. R. Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proc. 29th SIGIR, pages 284--291, 2006. Google ScholarDigital Library
- W. Hoeffding. Probability inequalities for sums of bounded random variables. J. ASA, 58(301):13--30, 1963.Google Scholar
- R. Horn and C. Johnson. Matrix analysis. Cambridge Univ Press, 1990. Google ScholarDigital Library
- P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th STOC, pages 604--613, 1998. Google ScholarDigital Library
- P. Indyk, R. Motwani, P. Raghavan, and S. Vempala. Locality-preserving hashing in multidimensional spaces. In Proc. 29th STOC, pages 618--625, 1997. Google ScholarDigital Library
- I. Ipsen and R. Rehman. Perturbation bounds for determinants and characteristic polynomials. SIAM J. Matrix Analysis and Applications, 30(2):762--776, 2008. Google ScholarDigital Library
- N. Koudas and D. Srivatsava. Approximate joins: Concepts and techniques. In Proc. 31st VLDB, page 1363, 2005. Google ScholarDigital Library
- E. Liberty, N. Ailon, and A. Singer. Dense fast random projections and lean Walsh transforms. In Proc. 12th RANDOM, pages 512--522, 2008. Google ScholarDigital Library
- Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In Proc. VLDB, pages 950--961, 2007. Google ScholarDigital Library
- G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. 16th WWW, pages 141--150, 2007. Google ScholarDigital Library
- C. McDiarmid. Concentration. In M. Habib, C. McDiarmid, J. Ramirez-Alfonsin, and B. Reed, editors, Probabilistic Methods for Algorithmic Discrete Mathematics, volume 16, pages 195--248. Springer, 1998.Google ScholarCross Ref
- R. Panigrahy. Entropy-based nearest neighbor search in high dimensions. In Proc. 17th SODA, pages 1186--1195, 2006. Google ScholarDigital Library
- J. Vybiral. A variant of the Johnson-Lindenstrauss lemma for circulant matrices. J. Functional Analysis, 260(4):1096--1105, 2011.Google ScholarCross Ref
- R. Weber, H. Schek, and S. Blott. A quantitative analysis and performance study for similarity search methods in high dimensional spaces. In Proc. 24th VLDB, pages 194--205, 1998. Google ScholarDigital Library
Index Terms
- Fast locality-sensitive hashing
Recommendations
Lower bounds on locality sensitive hashing
SCG '06: Proceedings of the twenty-second annual symposium on Computational geometryGiven a metric space (X,dX), c≥1, r>0, and p,q ≡ [0,1], a distribution over mappings H : X → N is called a (r,cr,p,q)-sensitive hash family if any two points in X at distance at most r are mapped by H to the same value with probability at least p, and ...
Efficient distributed locality sensitive hashing
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementDistributed frameworks are gaining increasingly widespread use in applications that process large amounts of data. One important example application is large scale similarity search, for which Locality Sensitive Hashing (LSH) has emerged as the method ...
Locality-sensitive hashing scheme based on dynamic collision counting
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of DataLocality-Sensitive Hashing (LSH) and its variants are well-known methods for solving the c-approximate NN Search problem in high-dimensional space. Traditionally, several LSH functions are concatenated to form a "static" compound hash function for ...
Comments