skip to main content
10.1145/1516360.1516446acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article
Free Access

Distributed similarity search in high dimensions using locality sensitive hashing

Published:24 March 2009Publication History

ABSTRACT

In this paper we consider distributed K-Nearest Neighbor (KNN) search and range query processing in high dimensional data. Our approach is based on Locality Sensitive Hashing (LSH) which has proven very efficient in answering KNN queries in centralized settings. We consider mappings from the multi-dimensional LSH bucket space to the linearly ordered set of peers that jointly maintain the indexed data and derive requirements to achieve high quality search results and limit the number of network accesses. We put forward two such mappings that come with these salient properties: being locality preserving so that buckets likely to hold similar data are stored on the same or neighboring peers and having a predictable output distribution to ensure fair load balancing. We show how to leverage the linearly aligned data for efficient KNN search and how to efficiently process range queries which is, to the best of our knowledge, not possible in existing LSH schemes. We show by comprehensive performance evaluations using real world data that our approach brings major performance and accuracy gains compared to state-of-the-art.

References

  1. Ittai Abraham, Yair Bartal, and Ofer Neiman. Advances in metric embedding theory. In STOC, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sunil Arya and David M. Mount. Approximate range searching. Comput. Geom., 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Vassilis Athitsos, Michalis Potamias, Panagiotis Papapetrou, and George Kollios. Nearest neighbor retrieval using distance-based hashing. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Farnoush Banaei-Kashani and Cyrus Shahabi. Swam: a family of access methods for similarity-search in peer-to-peer data networks. In CIKM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Mayank Bawa, Tyson Condie, and Prasanna Ganesan. Lsh forest: self-tuning indexes for similarity search. In WWW, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jon Louis Bentley. K-d trees for semidynamic point sets. In Symposium on Computational Geometry, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Stefan Berchtold, Christian Böhm, and Hans-Peter Kriegel. The pyramid-technique: towards breaking the curse of dimensionality. SIGMOD Rec., 27(2), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is "nearest neighbor" meaningful? In ICDT, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ashwin R. Bharambe, Mukesh Agrawal, and Srinivasan Seshan. Mercury: supporting scalable multi-attribute range queries. In SIGCOMM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Erik Buchmann and Klemens Böhm. Efficient evaluation of nearest-neighbor queries in content-addressable networks. In From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments, pages 31--40, 2005.Google ScholarGoogle Scholar
  12. Bernard Chazelle, Ding Liu, and Avner Magen. Approximate range searching in higher dimension. Comput. Geom., 39(1):24--29, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Adina Crainiceanu, Prakash Linga, Ashwin Machanavajjhala, Johannes Gehrke, and Jayavel Shanmugasundaram. P-ring: an efficient and robust p2p range index structure. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Symposium on Computational Geometry, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Christos Doulkeridis, Akrivi Vlachou, Yannis Kotidis, and Michalis Vazirgiannis. Peer-to-peer similarity search in metric spaces. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Fabrizio Falchi, Claudio Gennaro, and Pavel Zezula. A content-addressable network for similarity search in metric spaces. In DBISP2P, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Antonin Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD Conference, pages 47--57, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Parisa Haghani, Sebastian Michel, Philippe Cudré-Mauroux, and Karl Aberer. LSH at large -- distributed knn search in high dimensions. In WebDB, 2008.Google ScholarGoogle Scholar
  20. H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, and Rui Zhang 0003. idistance: An adaptive b+ -tree based indexing method for nearest neighbor search. TODS, 30(2), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. H. V. Jagadish, Beng Chin Ooi, Quang Hieu Vu, Rong Zhang, and Aoying Zhou. Vbi-tree: A peer-to-peer framework for supporting multi-dimensional indexing schemes. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Márk Jelasity, Alberto Montresor, and Özalp Babaoglu. Gossip-based aggregation in large dynamic networks. ACM Trans. Comput. Syst., 23(3), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe lsh: Efficient indexing for high-dimensional similarity search. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Michael Ortega, Yong Rui, Kaushik Chakrabarti, Kriengkrai Porkaew, Sharad Mehrotra, and Thomas S. Huang. Supporting ranked boolean similarity queries in mars. IEEE Trans. Knowl. Data Eng., 10(6), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In SODA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Theoni Pitoura, Nikos Ntarmos, and Peter Triantafillou. Replication, load balancing and efficient range query processing in dhts. In EDBT, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Theoni Pitoura and Peter Triantafillou. Load distribution fairness in p2p data management systems. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  28. Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard M. Karp, and Scott Shenker. A scalable content-addressable network. In SIGCOMM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ozgur D. Sahin, Fatih Emekçi, Divyakant Agrawal, and Amr El Abbadi. Content-based similarity search over peer-to-peer systems. In DBISP2P, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Chunqiang Tang, Zhichen Xu, and Sandhya Dwarkadas. Peer-to-peer information retrieval using self-organizing semantic overlay networks. In SIGCOMM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Cui Yu, Beng Chin Ooi, Kian-Lee Tan, and H. V. Jagadish. Indexing the distance: An efficient method to knn processing. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Chi Zhang, Arvind Krishnamurthy, and Randolph Y. Wang. Skipindex: Towards a scalable peer-to-peer index service for high dimensional data. Technical report, Department of Computer Science, Princeton University, May 2004.Google ScholarGoogle Scholar

Index Terms

  1. Distributed similarity search in high dimensions using locality sensitive hashing

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
            March 2009
            1180 pages
            ISBN:9781605584225
            DOI:10.1145/1516360

            Copyright © 2009 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 24 March 2009

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate7of10submissions,70%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader