skip to main content
10.1145/3077136.3080800acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

LoSHa: A General Framework for Scalable Locality Sensitive Hashing

Published: 07 August 2017 Publication History

Abstract

Locality Sensitive Hashing (LSH) algorithms are widely adopted to index similar items in high dimensional space for approximate nearest neighbor search. As the volume of real-world datasets keeps growing, it has become necessary to develop distributed LSH solutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus most existing solutions are developed on general-purpose platforms such as Hadoop and Spark. However, we argue that these platforms are both hard to use for programming LSH algorithms and inefficient for LSH computation. We propose LoSHa, a distributed computing framework that reduces the development cost by designing a tailor-made, general programming interface and achieves high efficiency by exploring LSH-specific system implementation and optimizations. We show that many LSH algorithms can be easily expressed in LoSHa's API. We evaluate LoSHa and also compare with general-purpose platforms on the same LSH algorithms. Our results show that LoSHa's performance can be an order of magnitude faster, while the implementations on LoSHa are even more intuitive and require few lines of code.

References

[1]
B. Bahmani, A. Goel, and R. Shinde. Efficient distributed locality sensitive hashing. In CIKM, pages 2174--2178, 2012.
[2]
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC, pages 327--336, 1998.
[3]
M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380--388, 2002.
[4]
A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007.
[5]
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, pages 253--262, 2004.
[6]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
[7]
J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In SIGMOD, pages 541--552, 2012.
[8]
J. Gao, H. V. Jagadish, W. Lu, and B. C. Ooi. DSH: data sensitive hashing for high-dimensional k-nnsearch. In SIGMOD, pages 1127--1138, 2014.
[9]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.
[10]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, pages 17--30, 2012.
[11]
P. Haghani, S. Michel, and K. Aberer. Distributed similarity search in high dimensions using locality sensitive hashing. In EDBT, pages 744--755, 2009.
[12]
Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng. Query-aware locality-sensitive hashing for approximate nearest neighbor search. In PVLDB, volume 9, pages 1--12, 2015.
[13]
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604--613, 1998.
[14]
Learning to Hash. http://cs.nju.edu.cn/lwj/l2h.html. 2017.
[15]
S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson, and E. P. Xing. On model parallelization and scheduling strategies for distributed machine learning. In NIPS, pages 2834--2842, 2014.
[16]
J. Li, J. Cheng, Y. Zhao, F. Yang, Y. Huang, H. Chen, and R. Zhao. A comparison of general-purpose distributed systems for data processing. In IEEE BigData, pages 378--383, 2016.
[17]
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su. Scaling distributed machine learning with the parameter server. In OSDI, pages 583--598, 2014.
[18]
LikeLike. https://github.com/takahi-i/likelike. 2017.
[19]
W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. In ICML, pages 1--8, 2011.
[20]
Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. SK-LSH: an efficient index structure for approximate nearest neighbor search. In PVLDB, volume 7, pages 745--756, 2014.
[21]
LSH-Hadoop. https://github.com/lancenorskog/lsh-hadoop. 2017.
[22]
LSH-Spark. https://github.com/marufaytekin/lsh-spark. 2017.
[23]
Y. Lu, J. Cheng, D. Yan, and H. Wu. Large-scale distributed graph computing systems: An experimental evaluation. In PVLDB, volume 8, pages 281--292, 2014.
[24]
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: efficient indexing for high-dimensional similarity search. In VLDB, pages 950--961, 2007.
[25]
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010.
[26]
L. Paulevé, H. Jégou, and L. Amsaleg. Locality sensitive hashing: A comparison of hash function types and querying mechanisms. In Pattern Recognition Letters, volume 31, pages 1348--1358, 2010.
[27]
A. Rajaraman, J. D. Ullman, J. D. Ullman, and J. D. Ullman. Mining of massive datasets, volume 1. 2012.
[28]
SoundCloud-LSH. https://github.com/soundcloud/cosine-lsh-join-spark. 2017.
[29]
Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. In PVLDB, volume 8, pages 1--12, 2014.
[30]
N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. In PVLDB, volume 6, pages 1930--1941, 2013.
[31]
Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In SIGMOD, pages 563--576, 2009.
[32]
J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. In CoRR, volume abs/1408.2927, 2014.
[33]
F. Yang, Y. Huang, Y. Zhao, J. Li, G. Jiang, and J. Cheng. The best of both worlds: Big data programming with both productivity and performance. In SIGMOD, pages 1619--1622, 2017.
[34]
F. Yang, J. Li, and J. Cheng. Husky: Towards a more efficient and expressive distributed computing framework. In PVLDB, volume 9, pages 420--431, 2016.
[35]
F. Yang, F. Shang, Y. Huang, J. Cheng, J. Li, Y. Zhao, and R. Zhao. LFTF: A framework for efficient tensor analytics at scale. In PVLDB, volume 10, pages 745--756, 2017.
[36]
Y. Zheng, Q. Guo, A. K. Tung, and S. Wu. Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index. In SIGMOD, 2016.

Cited By

View all
  • (2024)Toward Cost-Effective Adaptive Random Testing: An Approximate Nearest Neighbor ApproachIEEE Transactions on Software Engineering10.1109/TSE.2024.337959250:5(1182-1214)Online publication date: May-2024
  • (2024)NetDS: Distributed Search Framework with Hybrid Acceleration Methods2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)10.1109/ISPA63168.2024.00301(2203-2210)Online publication date: 30-Oct-2024
  • (2023)Deep Learning for Approximate Nearest Neighbour Search: A Survey and Future DirectionsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322068335:9(8997-9018)Online publication date: 1-Sep-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
August 2017
1476 pages
ISBN:9781450350228
DOI:10.1145/3077136
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. distributed similarity search
  2. locality sensitive hashing
  3. retrieval of high-dimensional data

Qualifiers

  • Research-article

Funding Sources

  • ITF
  • Research Committee of CUHK
  • Hong Kong RGC

Conference

SIGIR '17
Sponsor:

Acceptance Rates

SIGIR '17 Paper Acceptance Rate 78 of 362 submissions, 22%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Toward Cost-Effective Adaptive Random Testing: An Approximate Nearest Neighbor ApproachIEEE Transactions on Software Engineering10.1109/TSE.2024.337959250:5(1182-1214)Online publication date: May-2024
  • (2024)NetDS: Distributed Search Framework with Hybrid Acceleration Methods2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)10.1109/ISPA63168.2024.00301(2203-2210)Online publication date: 30-Oct-2024
  • (2023)Deep Learning for Approximate Nearest Neighbour Search: A Survey and Future DirectionsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322068335:9(8997-9018)Online publication date: 1-Sep-2023
  • (2023)A Dynamic Locality Sensitive Hashing Algorithm for Efficient Security Applications2023 International Conference on Information Technology (ICIT)10.1109/ICIT58056.2023.10226046(524-530)Online publication date: 9-Aug-2023
  • (2022)NetSHa: In-Network Acceleration of LSH-Based Distributed SearchIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.313584233:9(2213-2229)Online publication date: 1-Sep-2022
  • (2021)Large-Scale Network Embedding in Apache SparkProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467136(3271-3279)Online publication date: 14-Aug-2021
  • (2021)Accelerating LSH-based Distributed Search with In-network ComputationIEEE INFOCOM 2021 - IEEE Conference on Computer Communications10.1109/INFOCOM42981.2021.9488722(1-10)Online publication date: 10-May-2021
  • (2018)G-MinerProceedings of the Thirteenth EuroSys Conference10.1145/3190508.3190545(1-12)Online publication date: 23-Apr-2018
  • (2018)A General and Efficient Querying Method for Learning to HashProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183750(1333-1347)Online publication date: 27-May-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media