ABSTRACT
In this paper we consider distributed K-Nearest Neighbor (KNN) search and range query processing in high dimensional data. Our approach is based on Locality Sensitive Hashing (LSH) which has proven very efficient in answering KNN queries in centralized settings. We consider mappings from the multi-dimensional LSH bucket space to the linearly ordered set of peers that jointly maintain the indexed data and derive requirements to achieve high quality search results and limit the number of network accesses. We put forward two such mappings that come with these salient properties: being locality preserving so that buckets likely to hold similar data are stored on the same or neighboring peers and having a predictable output distribution to ensure fair load balancing. We show how to leverage the linearly aligned data for efficient KNN search and how to efficiently process range queries which is, to the best of our knowledge, not possible in existing LSH schemes. We show by comprehensive performance evaluations using real world data that our approach brings major performance and accuracy gains compared to state-of-the-art.
- Ittai Abraham, Yair Bartal, and Ofer Neiman. Advances in metric embedding theory. In STOC, 2006. Google ScholarDigital Library
- Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS, 2006. Google ScholarDigital Library
- Sunil Arya and David M. Mount. Approximate range searching. Comput. Geom., 2000. Google ScholarDigital Library
- Vassilis Athitsos, Michalis Potamias, Panagiotis Papapetrou, and George Kollios. Nearest neighbor retrieval using distance-based hashing. In ICDE, 2008. Google ScholarDigital Library
- Farnoush Banaei-Kashani and Cyrus Shahabi. Swam: a family of access methods for similarity-search in peer-to-peer data networks. In CIKM, 2004. Google ScholarDigital Library
- Mayank Bawa, Tyson Condie, and Prasanna Ganesan. Lsh forest: self-tuning indexes for similarity search. In WWW, 2005. Google ScholarDigital Library
- Jon Louis Bentley. K-d trees for semidynamic point sets. In Symposium on Computational Geometry, 1990. Google ScholarDigital Library
- Stefan Berchtold, Christian Böhm, and Hans-Peter Kriegel. The pyramid-technique: towards breaking the curse of dimensionality. SIGMOD Rec., 27(2), 1998. Google ScholarDigital Library
- Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is "nearest neighbor" meaningful? In ICDT, 1999. Google ScholarDigital Library
- Ashwin R. Bharambe, Mukesh Agrawal, and Srinivasan Seshan. Mercury: supporting scalable multi-attribute range queries. In SIGCOMM, 2004. Google ScholarDigital Library
- Erik Buchmann and Klemens Böhm. Efficient evaluation of nearest-neighbor queries in content-addressable networks. In From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments, pages 31--40, 2005.Google Scholar
- Bernard Chazelle, Ding Liu, and Avner Magen. Approximate range searching in higher dimension. Comput. Geom., 39(1):24--29, 2008. Google ScholarDigital Library
- Adina Crainiceanu, Prakash Linga, Ashwin Machanavajjhala, Johannes Gehrke, and Jayavel Shanmugasundaram. P-ring: an efficient and robust p2p range index structure. In SIGMOD, 2007. Google ScholarDigital Library
- Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Symposium on Computational Geometry, 2004. Google ScholarDigital Library
- Christos Doulkeridis, Akrivi Vlachou, Yannis Kotidis, and Michalis Vazirgiannis. Peer-to-peer similarity search in metric spaces. In VLDB, 2007. Google ScholarDigital Library
- Fabrizio Falchi, Claudio Gennaro, and Pavel Zezula. A content-addressable network for similarity search in metric spaces. In DBISP2P, 2005. Google ScholarDigital Library
- Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999. Google ScholarDigital Library
- Antonin Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD Conference, pages 47--57, 1984. Google ScholarDigital Library
- Parisa Haghani, Sebastian Michel, Philippe Cudré-Mauroux, and Karl Aberer. LSH at large -- distributed knn search in high dimensions. In WebDB, 2008.Google Scholar
- H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, and Rui Zhang 0003. idistance: An adaptive b+ -tree based indexing method for nearest neighbor search. TODS, 30(2), 2005. Google ScholarDigital Library
- H. V. Jagadish, Beng Chin Ooi, Quang Hieu Vu, Rong Zhang, and Aoying Zhou. Vbi-tree: A peer-to-peer framework for supporting multi-dimensional indexing schemes. In ICDE, 2006. Google ScholarDigital Library
- Márk Jelasity, Alberto Montresor, and Özalp Babaoglu. Gossip-based aggregation in large dynamic networks. ACM Trans. Comput. Syst., 23(3), 2005. Google ScholarDigital Library
- Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe lsh: Efficient indexing for high-dimensional similarity search. In VLDB, 2007. Google ScholarDigital Library
- Michael Ortega, Yong Rui, Kaushik Chakrabarti, Kriengkrai Porkaew, Sharad Mehrotra, and Thomas S. Huang. Supporting ranked boolean similarity queries in mars. IEEE Trans. Knowl. Data Eng., 10(6), 1998. Google ScholarDigital Library
- Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In SODA, 2006. Google ScholarDigital Library
- Theoni Pitoura, Nikos Ntarmos, and Peter Triantafillou. Replication, load balancing and efficient range query processing in dhts. In EDBT, 2006. Google ScholarDigital Library
- Theoni Pitoura and Peter Triantafillou. Load distribution fairness in p2p data management systems. In ICDE, 2007.Google ScholarCross Ref
- Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard M. Karp, and Scott Shenker. A scalable content-addressable network. In SIGCOMM, 2001. Google ScholarDigital Library
- Ozgur D. Sahin, Fatih Emekçi, Divyakant Agrawal, and Amr El Abbadi. Content-based similarity search over peer-to-peer systems. In DBISP2P, 2004. Google ScholarDigital Library
- Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM, 2001. Google ScholarDigital Library
- Chunqiang Tang, Zhichen Xu, and Sandhya Dwarkadas. Peer-to-peer information retrieval using self-organizing semantic overlay networks. In SIGCOMM, 2003. Google ScholarDigital Library
- Cui Yu, Beng Chin Ooi, Kian-Lee Tan, and H. V. Jagadish. Indexing the distance: An efficient method to knn processing. In VLDB, 2001. Google ScholarDigital Library
- Chi Zhang, Arvind Krishnamurthy, and Randolph Y. Wang. Skipindex: Towards a scalable peer-to-peer index service for high dimensional data. Technical report, Department of Computer Science, Princeton University, May 2004.Google Scholar
Index Terms
- Distributed similarity search in high dimensions using locality sensitive hashing
Recommendations
Efficient distributed locality sensitive hashing
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementDistributed frameworks are gaining increasingly widespread use in applications that process large amounts of data. One important example application is large scale similarity search, for which Locality Sensitive Hashing (LSH) has emerged as the method ...
Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space
Similarity search (e.g., k-nearest neighbor search) in high-dimensional metric space is the key operation in many applications, such as multimedia databases, image retrieval and object recognition, among others. The high dimensionality and the huge size ...
Query-aware locality-sensitive hashing for approximate nearest neighbor search
Locality-Sensitive Hashing (LSH) and its variants are the well-known indexing schemes for the c-Approximate Nearest Neighbor (c-ANN) search problem in high-dimensional Euclidean space. Traditionally, LSH functions are constructed in a query-oblivious ...
Comments