research-article

Self-taught hashing for fast similarity search

Authors:

Jinsong LuAuthors Info & Claims

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Pages 18 - 25

https://doi.org/10.1145/1835449.1835455

Published: 19 July 2010 Publication History

Abstract

The ability of fast similarity search at large scale is of great importance to many Information Retrieval (IR) applications. A promising way to accelerate similarity search is semantic hashing which designs compact binary codes for a large number of documents so that semantically similar documents are mapped to similar codes (within a short Hamming distance). Although some recently proposed techniques are able to generate high-quality codes for documents known in advance, obtaining the codes for previously unseen documents remains to be a very challenging problem. In this paper, we emphasise this issue and propose a novel Self-Taught Hashing (STH) approach to semantic hashing: we first find the optimal l-bit binary codes for all documents in the given corpus via unsupervised learning, and then train l classifiers via supervised learning to predict the l-bit code for any query document unseen before. Our experiments on three real-world text datasets show that the proposed approach using binarised Laplacian Eigenmap (LapEig) and linear Support Vector Machine (SVM) outperforms state-of-the-art techniques significantly.

References

[1]

A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 459--468, Berkeley, CA, USA, 2006.

Digital Library

[2]

S. Baluja and M. Covell. Learning to hash: Forgiving hash functions and applications. Data Mining and Knowledge Discovery (DMKD), 17(3):402--430, 2008.

Digital Library

[3]

M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373--1396, 2003.

Digital Library

[4]

S. Belongie, C. Fowlkes, F. Chung, and J. Malik. Spectral partitioning with indefinite kernels using the nystrom extension. In Proceedings of the 7th European Conference on Computer Vision (ECCV), pages 531--542, Copenhagen, Denmark, 2002.

Digital Library

[5]

M.W. Berry, S. T. Dumais, and G. W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573--595, 1995.

Digital Library

[6]

F. R. K. Chung. Spectral Graph Theory. American Mathematical Society, 1997.

[7]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press and McGraw-Hill, 2nd edition, 2001.

Digital Library

[8]

S. Deerwester, S. T. Dumais, G.W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science (JASIS), 41(6):391--407, 1990.

[9]

P. Drineas and M. W. Mahoney. On the nystrom method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research (JMLR), 6:2153--2175, 2005.

Digital Library

[10]

S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th ACM International Conference on Information and Knowledge Management (CIKM), pages 148--155, Bethesda, MD, 1998.

Digital Library

[11]

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871--1874, 2008.

Digital Library

[12]

G. H. Golub and C. F. V. Loan. Matrix Computations. The Johns Hopkins University Press, 3rd edition, 1996.

Digital Library

[13]

L. W. Hagen and A. B. Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE Transactions on CAD of Integrated Circuits and Systems, 11(9):1074--1085, 1992.

Digital Library

[14]

A. Y. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8--12, 2009.

Digital Library

[15]

J. Hays and A. A. Efros. Scene completion using millions of photographs. ACM Transactions on Graphics (TOG), 26(3):4, 2007.

Digital Library

[16]

X. He, D. Cai, H. Liu, and W.-Y. Ma. Locality preserving indexing for document representation. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 96--103, Sheffield, UK, 2004.

Digital Library

[17]

X. He and P. Niyogi. Locality preserving projections. In Advances in Neural Information Processing Systems (NIPS), volume 16, pages 153--160, Vancouver and Whistler, Canada, 2003.

[18]

M. R. Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 284--291, Seattle, WA, USA, 2006.

Digital Library

[19]

G. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527--1554, 2006.

Digital Library

[20]

G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, July 2006.

[21]

C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th International Conference on Machine Learning (ICML), pages 408--415, Helsinki, Finland, 2008.

Digital Library

[22]

T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (ECML), pages 137--142, Chemnitz, Germany, 1998.

Digital Library

[23]

T. Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002.

[24]

T. Joachims. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 217--226, Philadelphia, PA, 2006.

Digital Library

[25]

D. Knuth. The Art of Computer Programming. Addison-Wesley, 3rd edition, 1997.

Digital Library

[26]

Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 426--434, Las Vegas, NV, USA, 2008.

Digital Library

[27]

K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the 12th International Conference on Machine Learning (ICML), pages 331--339, Tahoe City, CA, 1995.

[28]

M. S. Lew, N. Sebe, C. Djeraba, and R. Jain. Content-based multimedia information retrieval: State of the art and challenges. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), 2(1):1--19, 2006.

Digital Library

[29]

J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 155--162, Boston, MA, USA, 2009.

Digital Library

[30]

C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.

[31]

T. Mitchell. Machine Learning. McGraw Hill, international edition, 1997.

Digital Library

[32]

S. Pandey, A. Broder, F. Chierichetti, V. Josifovski, R. Kumar, and S. Vassilvitskii. Nearest-neighbor caching for content-match applications. In Proceedings of the 18th International Conference on World Wide Web (WWW), pages 441--450, Madrid, Spain, 2009.

Digital Library

[33]

R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: Transfer learning from unlabeled data. In Proceedings of the 24th International Conference on Machine Learning (ICML), pages 759--766, Corvalis, OR, USA, 2007.

Digital Library

[34]

R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate Reasoning (IJAR), 50(7):969--978, 2009.

Digital Library

[35]

R. E. Schapire. The boosting approach to machine learning: An overview. In Nonlinear Estimation and Classification. Springer, 2003.

[36]

B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.

[37]

F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.

Digital Library

[38]

G. Shakhnarovich, P. A. Viola, and T. Darrell. Fast pose estimation with parameter-sensitive hashing. In Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV), pages 750--759, Nice, France, 2003.

Digital Library

[39]

C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379--423, 623--656, 1948.

[40]

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 22(8):888--905, 2000.

Digital Library

[41]

Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and S. Vishwanathan. Hash kernels for structured data. Journal of Machine Learning Research (JMLR), 10:2615--2637, 2009.

Digital Library

[42]

B. Stein. Principles of hash-based text retrieval. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 527--534, Amsterdam, The Netherlands, 2007.

Digital Library

[43]

B. Stein, S. M. zu Eissen, and M. Potthast. Strategies for retrieving plagiarized documents. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 825--826, Amsterdam, The Netherlands, 2007.

Digital Library

[44]

A. B. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 1--8, Anchorage, AK, USA, 2008.

[45]

R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of 24th International Conference on Very Large Data Bases (VLDB), pages 194--205, New York City, USA, 1998.

Digital Library

[46]

P. Wegner. A technique for counting ones in a binary computer. Communications of the ACM (CACM), 3(5):322, 1960.

Digital Library

[47]

K. Q. Weinberger, A. Dasgupta, J. Langford, A. J. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), page 140, Montreal, Quebec, Canada, 2009.

Digital Library

[48]

Y. Weiss, A. B. Torralba, and R. Fergus. Spectral hashing. In Advances in Neural Information Processing Systems (NIPS), volume 21, pages 1753--1760, Vancouver, Canada, 2008.

[49]

Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 42--49, Berkeley, CA, 1999.

Digital Library

[50]

D. Zhang, J. Wang, D. Cai, and J. Lu. Laplacian co-hashing of terms and documents. In Proceedings of the 32nd European Conference on IR Research (ECIR), page 577--580, Milton Keynes, UK, 2010.

Digital Library

Cited By

He CWei HGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)HybridHash: Hybrid Convolutional and Self-Attention Deep Hashing for Image RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658014(824-832)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658014
Duan JHao YZhu BCheng LZhou PWang X(2024)Efficient Unsupervised Video Hashing With Contextual Modeling and Structural ControllingIEEE Transactions on Multimedia10.1109/TMM.2024.336892426(7438-7450)Online publication date: 22-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3368924
Zhao WZhou DCao BZhang KChen J(2024)Adversarial Modality Alignment Network for Cross-Modal Molecule RetrievalIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.32545185:1(278-289)Online publication date: Jan-2024
https://doi.org/10.1109/TAI.2023.3254518
Show More Cited By

Index Terms

Self-taught hashing for fast similarity search
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval
    1. Document representation
  2. Information systems applications
    1. Data mining

Recommendations

Unsupervised Semantic Hashing with Pairwise Reconstruction
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Semantic Hashing is a popular family of methods for efficient similarity search in large-scale datasets. In Semantic Hashing, documents are encoded as short binary vectors (i.e., hash codes), such that semantic similarity can be efficiently computed ...
Unsupervised Multi-Index Semantic Hashing
WWW '21: Proceedings of the Web Conference 2021

Semantic hashing represents documents as compact binary vectors (hash codes) and allows both efficient and effective similarity search in large-scale information retrieval. The state of the art has primarily focused on learning hash codes that improve ...
Weighted hashing for fast large scale similarity search
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Similarity search, or finding approximate nearest neighbors, is an important technique for many applications. Many recent research demonstrate that hashing methods can achieve promising results for large scale similarity search due to its computational ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

July 2010

944 pages

ISBN:9781450301534

DOI:10.1145/1835449

General Chairs:
Fabio Crestani
University of Lugano, CH
,
Stéphane Marchand-Maillet
University of Geneva, CH
,
Program Chairs:
Hsin-Hsi Chen
National Taiwan University, TW
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jacques Savoy
University of Neuchatel, CH

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '10

Sponsor:

SIGIR

SIGIR '10: The 33rd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2010

Geneva, Switzerland

Acceptance Rates

SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

264
Total Citations
View Citations
1,695
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)1

Reflects downloads up to 31 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

He CWei HGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)HybridHash: Hybrid Convolutional and Self-Attention Deep Hashing for Image RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658014(824-832)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658014
Duan JHao YZhu BCheng LZhou PWang X(2024)Efficient Unsupervised Video Hashing With Contextual Modeling and Structural ControllingIEEE Transactions on Multimedia10.1109/TMM.2024.336892426(7438-7450)Online publication date: 22-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3368924
Zhao WZhou DCao BZhang KChen J(2024)Adversarial Modality Alignment Network for Cross-Modal Molecule RetrievalIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.32545185:1(278-289)Online publication date: Jan-2024
https://doi.org/10.1109/TAI.2023.3254518
Wu SYuan XXiao GLew MGao X(2024)Deep cross-modal hashing with multi-task latent space learningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108944136(108944)Online publication date: Oct-2024
https://doi.org/10.1016/j.engappai.2024.108944
Singh USinha ISingh KVerma S(2024)Meta-DPSTL: meta learning-based differentially private self-taught learningInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02134-215:9(4021-4053)Online publication date: 9-May-2024
https://doi.org/10.1007/s13042-024-02134-2
He LHuang ZChen ELiu QTong SWang HLian DWang S(2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1145/3570725
He CWei H(2023)Transformer-Based Deep Hashing Method for Multi-Scale Feature FusionICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10094794(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10094794
Zhang MMa YLuo GLi SQian ZZhang X(2023)Identifying the Style of Chatting2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC58517.2023.10317307(1085-1092)Online publication date: 31-Oct-2023
https://doi.org/10.1109/APSIPAASC58517.2023.10317307
He CWei HLu K(2023)Multi-feature Fusion-Based Central Similarity Deep Supervised HashingPattern Recognition and Computer Vision10.1007/978-981-99-8540-1_27(335-345)Online publication date: 25-Dec-2023
https://doi.org/10.1007/978-981-99-8540-1_27
Chen MYan JGao TLi YMa H(2022)Duplicate Image Representation Based on Semi-Supervised LearningInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.30157814:1(1-13)Online publication date: 29-Jun-2022
https://dl.acm.org/doi/10.4018/IJGHPC.301578
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten