skip to main content
10.1145/2020408.2020578acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
poster

Fast locality-sensitive hashing

Published:21 August 2011Publication History

ABSTRACT

Locality-sensitive hashing (LSH) is a basic primitive in several large-scale data processing applications, including nearest-neighbor search, de-duplication, clustering, etc. In this paper we propose a new and simple method to speed up the widely-used Euclidean realization of LSH. At the heart of our method is a fast way to estimate the Euclidean distance between two d-dimensional vectors; this is achieved by the use of randomized Hadamard transforms in a non-linear setting. This decreases the running time of a (k, L)-parameterized LSH from O(dkL) to O(dlog d + kL). Our experiments show that using the new LSH in nearest-neighbor applications can improve their running times by significant amounts. To the best of our knowledge, this is the first running time improvement to LSH that is both provable and practical.

References

  1. N. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302--322, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Ailon and E. Liberty. Fast dimension reduction using Rademacher series on dual BCH codes. Discrete and Computational Geometry, 42(4):615--630, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Andoni. Nearest Neighbor Search: the Old, the New, and the Impossible. PhD thesis, MIT, 2009.Google ScholarGoogle Scholar
  4. A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Y. Bachrach and E. Porat. Fast pseudo-random fingerprints. Arxiv preprint arXiv:1009.5791, 2010.Google ScholarGoogle Scholar
  6. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proc. 16th WWW, pages 131--140, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proc. 6th WWW, pages 391--404, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630--659, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th STOC, pages 380--388, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chien and N. Immorlica. Semantic similarity between search engine queries using temporal correlation. In Proc. 14th WWW, pages 2--11, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Danziger, J. Zeng, Y. Wang, R. Brachmann, and R. Lathrop. Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants. Bioinformatics, 23(13):i104, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. 20th SOCG, pages 253--262, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Eshghi and S. Rajaram. Locality sensitive hash functions based on concomitant rank order statistics. In Proc. 14th KDD, pages 221--229, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Feigenblat, E. Porat, and A. Shiftan. Even better framework for min-wise based algorithms. Arxiv preprint arXiv:1102.3537, 2011.Google ScholarGoogle Scholar
  15. I. K. Fodor. A survey of dimension reduction techniques. Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  16. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. 25th VLDB, pages 518--529, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115--1145, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. He, W. Liu, and S. Chang. Scalable similarity search with optimized kernel hashing. In Proc. 16th KDD, pages 1129--1138, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. R. Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proc. 29th SIGIR, pages 284--291, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Hoeffding. Probability inequalities for sums of bounded random variables. J. ASA, 58(301):13--30, 1963.Google ScholarGoogle Scholar
  21. R. Horn and C. Johnson. Matrix analysis. Cambridge Univ Press, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th STOC, pages 604--613, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Indyk, R. Motwani, P. Raghavan, and S. Vempala. Locality-preserving hashing in multidimensional spaces. In Proc. 29th STOC, pages 618--625, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. I. Ipsen and R. Rehman. Perturbation bounds for determinants and characteristic polynomials. SIAM J. Matrix Analysis and Applications, 30(2):762--776, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. N. Koudas and D. Srivatsava. Approximate joins: Concepts and techniques. In Proc. 31st VLDB, page 1363, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Liberty, N. Ailon, and A. Singer. Dense fast random projections and lean Walsh transforms. In Proc. 12th RANDOM, pages 512--522, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In Proc. VLDB, pages 950--961, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. 16th WWW, pages 141--150, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. McDiarmid. Concentration. In M. Habib, C. McDiarmid, J. Ramirez-Alfonsin, and B. Reed, editors, Probabilistic Methods for Algorithmic Discrete Mathematics, volume 16, pages 195--248. Springer, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  30. R. Panigrahy. Entropy-based nearest neighbor search in high dimensions. In Proc. 17th SODA, pages 1186--1195, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Vybiral. A variant of the Johnson-Lindenstrauss lemma for circulant matrices. J. Functional Analysis, 260(4):1096--1105, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  32. R. Weber, H. Schek, and S. Blott. A quantitative analysis and performance study for similarity search methods in high dimensional spaces. In Proc. 24th VLDB, pages 194--205, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast locality-sensitive hashing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2011
      1446 pages
      ISBN:9781450308137
      DOI:10.1145/2020408

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 August 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • poster

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader