research-article

LoSHa: A General Framework for Scalable Locality Sensitive Hashing

Authors:

Ruihao ZhaoAuthors Info & Claims

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 635 - 644

https://doi.org/10.1145/3077136.3080800

Published: 07 August 2017 Publication History

Abstract

Locality Sensitive Hashing (LSH) algorithms are widely adopted to index similar items in high dimensional space for approximate nearest neighbor search. As the volume of real-world datasets keeps growing, it has become necessary to develop distributed LSH solutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus most existing solutions are developed on general-purpose platforms such as Hadoop and Spark. However, we argue that these platforms are both hard to use for programming LSH algorithms and inefficient for LSH computation. We propose LoSHa, a distributed computing framework that reduces the development cost by designing a tailor-made, general programming interface and achieves high efficiency by exploring LSH-specific system implementation and optimizations. We show that many LSH algorithms can be easily expressed in LoSHa's API. We evaluate LoSHa and also compare with general-purpose platforms on the same LSH algorithms. Our results show that LoSHa's performance can be an order of magnitude faster, while the implementations on LoSHa are even more intuitive and require few lines of code.

References

[1]

B. Bahmani, A. Goel, and R. Shinde. Efficient distributed locality sensitive hashing. In CIKM, pages 2174--2178, 2012.

Digital Library

[2]

A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC, pages 327--336, 1998.

Digital Library

[3]

M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380--388, 2002.

Digital Library

[4]

A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007.

Digital Library

[5]

M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, pages 253--262, 2004.

Digital Library

[6]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.

Digital Library

[7]

J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In SIGMOD, pages 541--552, 2012.

Digital Library

[8]

J. Gao, H. V. Jagadish, W. Lu, and B. C. Ooi. DSH: data sensitive hashing for high-dimensional k-nnsearch. In SIGMOD, pages 1127--1138, 2014.

Digital Library

[9]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.

Digital Library

[10]

J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, pages 17--30, 2012.

Digital Library

[11]

P. Haghani, S. Michel, and K. Aberer. Distributed similarity search in high dimensions using locality sensitive hashing. In EDBT, pages 744--755, 2009.

Digital Library

[12]

Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng. Query-aware locality-sensitive hashing for approximate nearest neighbor search. In PVLDB, volume 9, pages 1--12, 2015.

Digital Library

[13]

P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604--613, 1998.

Digital Library

[14]

Learning to Hash. http://cs.nju.edu.cn/lwj/l2h.html. 2017.

[15]

S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson, and E. P. Xing. On model parallelization and scheduling strategies for distributed machine learning. In NIPS, pages 2834--2842, 2014.

Digital Library

[16]

J. Li, J. Cheng, Y. Zhao, F. Yang, Y. Huang, H. Chen, and R. Zhao. A comparison of general-purpose distributed systems for data processing. In IEEE BigData, pages 378--383, 2016.

[17]

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su. Scaling distributed machine learning with the parameter server. In OSDI, pages 583--598, 2014.

Digital Library

[18]

LikeLike. https://github.com/takahi-i/likelike. 2017.

[19]

W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. In ICML, pages 1--8, 2011.

Digital Library

[20]

Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. SK-LSH: an efficient index structure for approximate nearest neighbor search. In PVLDB, volume 7, pages 745--756, 2014.

Digital Library

[21]

LSH-Hadoop. https://github.com/lancenorskog/lsh-hadoop. 2017.

[22]

LSH-Spark. https://github.com/marufaytekin/lsh-spark. 2017.

[23]

Y. Lu, J. Cheng, D. Yan, and H. Wu. Large-scale distributed graph computing systems: An experimental evaluation. In PVLDB, volume 8, pages 281--292, 2014.

Digital Library

[24]

Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: efficient indexing for high-dimensional similarity search. In VLDB, pages 950--961, 2007.

Digital Library

[25]

G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010.

Digital Library

[26]

L. Paulevé, H. Jégou, and L. Amsaleg. Locality sensitive hashing: A comparison of hash function types and querying mechanisms. In Pattern Recognition Letters, volume 31, pages 1348--1358, 2010.

Digital Library

[27]

A. Rajaraman, J. D. Ullman, J. D. Ullman, and J. D. Ullman. Mining of massive datasets, volume 1. 2012.

Digital Library

[28]

SoundCloud-LSH. https://github.com/soundcloud/cosine-lsh-join-spark. 2017.

[29]

Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. In PVLDB, volume 8, pages 1--12, 2014.

Digital Library

[30]

N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. In PVLDB, volume 6, pages 1930--1941, 2013.

Digital Library

[31]

Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In SIGMOD, pages 563--576, 2009.

Digital Library

[32]

J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. In CoRR, volume abs/1408.2927, 2014.

[33]

F. Yang, Y. Huang, Y. Zhao, J. Li, G. Jiang, and J. Cheng. The best of both worlds: Big data programming with both productivity and performance. In SIGMOD, pages 1619--1622, 2017.

Digital Library

[34]

F. Yang, J. Li, and J. Cheng. Husky: Towards a more efficient and expressive distributed computing framework. In PVLDB, volume 9, pages 420--431, 2016.

Digital Library

[35]

F. Yang, F. Shang, Y. Huang, J. Cheng, J. Li, Y. Zhao, and R. Zhao. LFTF: A framework for efficient tensor analytics at scale. In PVLDB, volume 10, pages 745--756, 2017.

Digital Library

[36]

Y. Zheng, Q. Guo, A. K. Tung, and S. Wu. Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index. In SIGMOD, 2016.

Digital Library

Cited By

Huang RCui CLian JTowey DSun WChen H(2024)Toward Cost-Effective Adaptive Random Testing: An Approximate Nearest Neighbor ApproachIEEE Transactions on Software Engineering10.1109/TSE.2024.337959250:5(1182-1214)Online publication date: May-2024
https://doi.org/10.1109/TSE.2024.3379592
Zhang PHu Z(2024)NetDS: Distributed Search Framework with Hybrid Acceleration Methods2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)10.1109/ISPA63168.2024.00301(2203-2210)Online publication date: 30-Oct-2024
https://doi.org/10.1109/ISPA63168.2024.00301
Li MWang YZhang PWang HFan LLi EWang W(2023)Deep Learning for Approximate Nearest Neighbour Search: A Survey and Future DirectionsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322068335:9(8997-9018)Online publication date: 1-Sep-2023
https://doi.org/10.1109/TKDE.2022.3220683
Show More Cited By

Recommendations

Dynamic Multi-probe LSH: An I/O Efficient Index Structure for Approximate Nearest Neighbor Search
DEXA 2013: Proceedings of the 24th International Conference on Database and Expert Systems Applications - Volume 8055

Locality-Sensitive Hashing LSH is widely used to solve approximate nearest neighbor search problems in high-dimensional spaces. The basic idea is to map the "nearby" objects into a same hash bucket with high probability. A significant drawback is that ...
A posteriori multi-probe locality sensitive hashing
MM '08: Proceedings of the 16th ACM international conference on Multimedia

Efficient high-dimensional similarity search structures are essential for building scalable content-based search systems on feature-rich multimedia data. In the last decade, Locality Sensitive Hashing (LSH) has been proposed as indexing technique for ...
A robust method based on locality sensitive hashing for K-nearest neighbors searching
Abstract
K-nearest neighbors searching (KNNS) is to find K-nearest neighbors for query points. It is a primary problem in clustering analysis, classification, outlier detection and pattern recognition, and has been widely used in various applications. The ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

August 2017

1476 pages

ISBN:9781450350228

DOI:10.1145/3077136

General Chairs:
Noriko Kando
National Institute of Informatics
,
Tetsuya Sakai
Waseda University
,
Hideo Joho
University of Tsukuba
,
Program Chairs:
Hang Li
Huawei Noah's Ark Lab
,
Arjen P. de Vries
Radboud University
,
Ryen W. White
Microsoft Cortana

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

ITF
Research Committee of CUHK
Hong Kong RGC

Conference

SIGIR '17

Sponsor:

SIGIR

SIGIR '17: The 40th International ACM SIGIR conference on research and development in Information Retrieval

August 7 - 11, 2017

Tokyo, Shinjuku, Japan

Acceptance Rates

SIGIR '17 Paper Acceptance Rate 78 of 362 submissions, 22%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
460
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang RCui CLian JTowey DSun WChen H(2024)Toward Cost-Effective Adaptive Random Testing: An Approximate Nearest Neighbor ApproachIEEE Transactions on Software Engineering10.1109/TSE.2024.337959250:5(1182-1214)Online publication date: May-2024
https://doi.org/10.1109/TSE.2024.3379592
Zhang PHu Z(2024)NetDS: Distributed Search Framework with Hybrid Acceleration Methods2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)10.1109/ISPA63168.2024.00301(2203-2210)Online publication date: 30-Oct-2024
https://doi.org/10.1109/ISPA63168.2024.00301
Li MWang YZhang PWang HFan LLi EWang W(2023)Deep Learning for Approximate Nearest Neighbour Search: A Survey and Future DirectionsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322068335:9(8997-9018)Online publication date: 1-Sep-2023
https://doi.org/10.1109/TKDE.2022.3220683
Y.AlKhanafseh MM.Surakhi O(2023)A Dynamic Locality Sensitive Hashing Algorithm for Efficient Security Applications2023 International Conference on Information Technology (ICIT)10.1109/ICIT58056.2023.10226046(524-530)Online publication date: 9-Aug-2023
https://doi.org/10.1109/ICIT58056.2023.10226046
Zhang PPan HLi ZCui PJia RHe PZhang ZTyson GXie G(2022)NetSHa: In-Network Acceleration of LSH-Based Distributed SearchIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.313584233:9(2213-2229)Online publication date: 1-Sep-2022
https://doi.org/10.1109/TPDS.2021.3135842
Lin WZhu FChin Ooi BMiao CWang HSkrypnyk IHsu WChawla S(2021)Large-Scale Network Embedding in Apache SparkProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467136(3271-3279)Online publication date: 14-Aug-2021
https://dl.acm.org/doi/10.1145/3447548.3467136
Zhang PPan HLi ZHe PZhang ZTyson GXie G(2021)Accelerating LSH-based Distributed Search with In-network ComputationIEEE INFOCOM 2021 - IEEE Conference on Computer Communications10.1109/INFOCOM42981.2021.9488722(1-10)Online publication date: 10-May-2021
https://doi.org/10.1109/INFOCOM42981.2021.9488722
Chen HLiu MZhao YYan XYan DCheng JOliveira RFelber PHu Y(2018)G-MinerProceedings of the Thirteenth EuroSys Conference10.1145/3190508.3190545(1-12)Online publication date: 23-Apr-2018
https://dl.acm.org/doi/10.1145/3190508.3190545
Li JYan XZhang JXu ACheng JLiu JNg KCheng TDas GJermaine CBernstein P(2018)A General and Efficient Querying Method for Learning to HashProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183750(1333-1347)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3183713.3183750

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten