Distributed similarity search in high dimensions using locality sensitive hashing

Authors:
Parisa Haghani

EPFL Lausanne, Switzerland

EPFL Lausanne, Switzerland
View Profile

,
Sebastian Michel

EPFL Lausanne, Switzerland

EPFL Lausanne, Switzerland
View Profile

,
Karl Aberer

EPFL Lausanne, Switzerland

EPFL Lausanne, Switzerland
View Profile

EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database TechnologyMarch 2009Pages 744–755https://doi.org/10.1145/1516360.1516446

Published:24 March 2009Publication History

EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

Pages 744–755

ABSTRACT

In this paper we consider distributed K-Nearest Neighbor (KNN) search and range query processing in high dimensional data. Our approach is based on Locality Sensitive Hashing (LSH) which has proven very efficient in answering KNN queries in centralized settings. We consider mappings from the multi-dimensional LSH bucket space to the linearly ordered set of peers that jointly maintain the indexed data and derive requirements to achieve high quality search results and limit the number of network accesses. We put forward two such mappings that come with these salient properties: being locality preserving so that buckets likely to hold similar data are stored on the same or neighboring peers and having a predictable output distribution to ensure fair load balancing. We show how to leverage the linearly aligned data for efficient KNN search and how to efficiently process range queries which is, to the best of our knowledge, not possible in existing LSH schemes. We show by comprehensive performance evaluations using real world data that our approach brings major performance and accuracy gains compared to state-of-the-art.

References

Ittai Abraham, Yair Bartal, and Ofer Neiman. Advances in metric embedding theory. In STOC, 2006. Google ScholarDigital Library
Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS, 2006. Google ScholarDigital Library
Sunil Arya and David M. Mount. Approximate range searching. Comput. Geom., 2000. Google ScholarDigital Library
Vassilis Athitsos, Michalis Potamias, Panagiotis Papapetrou, and George Kollios. Nearest neighbor retrieval using distance-based hashing. In ICDE, 2008. Google ScholarDigital Library
Farnoush Banaei-Kashani and Cyrus Shahabi. Swam: a family of access methods for similarity-search in peer-to-peer data networks. In CIKM, 2004. Google ScholarDigital Library
Mayank Bawa, Tyson Condie, and Prasanna Ganesan. Lsh forest: self-tuning indexes for similarity search. In WWW, 2005. Google ScholarDigital Library
Jon Louis Bentley. K-d trees for semidynamic point sets. In Symposium on Computational Geometry, 1990. Google ScholarDigital Library
Stefan Berchtold, Christian Böhm, and Hans-Peter Kriegel. The pyramid-technique: towards breaking the curse of dimensionality. SIGMOD Rec., 27(2), 1998. Google ScholarDigital Library
Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is "nearest neighbor" meaningful? In ICDT, 1999. Google ScholarDigital Library
Ashwin R. Bharambe, Mukesh Agrawal, and Srinivasan Seshan. Mercury: supporting scalable multi-attribute range queries. In SIGCOMM, 2004. Google ScholarDigital Library
Erik Buchmann and Klemens Böhm. Efficient evaluation of nearest-neighbor queries in content-addressable networks. In From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments, pages 31--40, 2005.Google Scholar
Bernard Chazelle, Ding Liu, and Avner Magen. Approximate range searching in higher dimension. Comput. Geom., 39(1):24--29, 2008. Google ScholarDigital Library
Adina Crainiceanu, Prakash Linga, Ashwin Machanavajjhala, Johannes Gehrke, and Jayavel Shanmugasundaram. P-ring: an efficient and robust p2p range index structure. In SIGMOD, 2007. Google ScholarDigital Library
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Symposium on Computational Geometry, 2004. Google ScholarDigital Library
Christos Doulkeridis, Akrivi Vlachou, Yannis Kotidis, and Michalis Vazirgiannis. Peer-to-peer similarity search in metric spaces. In VLDB, 2007. Google ScholarDigital Library
Fabrizio Falchi, Claudio Gennaro, and Pavel Zezula. A content-addressable network for similarity search in metric spaces. In DBISP2P, 2005. Google ScholarDigital Library
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999. Google ScholarDigital Library
Antonin Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD Conference, pages 47--57, 1984. Google ScholarDigital Library
Parisa Haghani, Sebastian Michel, Philippe Cudré-Mauroux, and Karl Aberer. LSH at large -- distributed knn search in high dimensions. In WebDB, 2008.Google Scholar
H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, and Rui Zhang 0003. idistance: An adaptive b⁺ -tree based indexing method for nearest neighbor search. TODS, 30(2), 2005. Google ScholarDigital Library
H. V. Jagadish, Beng Chin Ooi, Quang Hieu Vu, Rong Zhang, and Aoying Zhou. Vbi-tree: A peer-to-peer framework for supporting multi-dimensional indexing schemes. In ICDE, 2006. Google ScholarDigital Library
Márk Jelasity, Alberto Montresor, and Özalp Babaoglu. Gossip-based aggregation in large dynamic networks. ACM Trans. Comput. Syst., 23(3), 2005. Google ScholarDigital Library
Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe lsh: Efficient indexing for high-dimensional similarity search. In VLDB, 2007. Google ScholarDigital Library
Michael Ortega, Yong Rui, Kaushik Chakrabarti, Kriengkrai Porkaew, Sharad Mehrotra, and Thomas S. Huang. Supporting ranked boolean similarity queries in mars. IEEE Trans. Knowl. Data Eng., 10(6), 1998. Google ScholarDigital Library
Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In SODA, 2006. Google ScholarDigital Library
Theoni Pitoura, Nikos Ntarmos, and Peter Triantafillou. Replication, load balancing and efficient range query processing in dhts. In EDBT, 2006. Google ScholarDigital Library
Theoni Pitoura and Peter Triantafillou. Load distribution fairness in p2p data management systems. In ICDE, 2007.Google ScholarCross Ref
Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard M. Karp, and Scott Shenker. A scalable content-addressable network. In SIGCOMM, 2001. Google ScholarDigital Library
Ozgur D. Sahin, Fatih Emekçi, Divyakant Agrawal, and Amr El Abbadi. Content-based similarity search over peer-to-peer systems. In DBISP2P, 2004. Google ScholarDigital Library
Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM, 2001. Google ScholarDigital Library
Chunqiang Tang, Zhichen Xu, and Sandhya Dwarkadas. Peer-to-peer information retrieval using self-organizing semantic overlay networks. In SIGCOMM, 2003. Google ScholarDigital Library
Cui Yu, Beng Chin Ooi, Kian-Lee Tan, and H. V. Jagadish. Indexing the distance: An efficient method to knn processing. In VLDB, 2001. Google ScholarDigital Library
Chi Zhang, Arvind Krishnamurthy, and Randolph Y. Wang. Skipindex: Towards a scalable peer-to-peer index service for high dimensional data. Technical report, Department of Computer Science, Princeton University, May 2004.Google Scholar

Index Terms

Distributed similarity search in high dimensions using locality sensitive hashing
1. Information systems
  1. Information retrieval
  2. Information systems applications

Recommendations

Efficient distributed locality sensitive hashing
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Distributed frameworks are gaining increasingly widespread use in applications that process large amounts of data. One important example application is large scale similarity search, for which Locality Sensitive Hashing (LSH) has emerged as the method ...
Read More
Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space

Similarity search (e.g., k-nearest neighbor search) in high-dimensional metric space is the key operation in many applications, such as multimedia databases, image retrieval and object recognition, among others. The high dimensionality and the huge size ...
Read More
Query-aware locality-sensitive hashing for approximate nearest neighbor search

Locality-Sensitive Hashing (LSH) and its variants are the well-known indexing schemes for the c-Approximate Nearest Neighbor (c-ANN) search problem in high-dimensional Euclidean space. Traditionally, LSH functions are constructed in a query-oblivious ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
March 2009
1180 pages
ISBN:9781605584225
DOI:10.1145/1516360
Editors:
Martin Kersten
CWI, The Netherlands
,
Boris Novikov
University of Saint Petersburg, Russia
,
Jens Teubner
ETH Zurich, Switzerland
,
Vladimir Polutin
HP Labs, Russia
,
Stefan Manegold
CWI, The Netherlands
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 March 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
KNN search
P2P
distributed high dimensional search
high dimensionality
range query
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate7of10submissions,70%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 65
  Total Citations
  View Citations
- 1,267
  Total Downloads
- Downloads (Last 12 months)95
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Distributed similarity search in high dimensions using locality sensitive hashing

EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient distributed locality sensitive hashing

Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space

Query-aware locality-sensitive hashing for approximate nearest neighbor search