skip to main content
10.1145/1835449.1835455acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Self-taught hashing for fast similarity search

Published: 19 July 2010 Publication History

Abstract

The ability of fast similarity search at large scale is of great importance to many Information Retrieval (IR) applications. A promising way to accelerate similarity search is semantic hashing which designs compact binary codes for a large number of documents so that semantically similar documents are mapped to similar codes (within a short Hamming distance). Although some recently proposed techniques are able to generate high-quality codes for documents known in advance, obtaining the codes for previously unseen documents remains to be a very challenging problem. In this paper, we emphasise this issue and propose a novel Self-Taught Hashing (STH) approach to semantic hashing: we first find the optimal l-bit binary codes for all documents in the given corpus via unsupervised learning, and then train l classifiers via supervised learning to predict the l-bit code for any query document unseen before. Our experiments on three real-world text datasets show that the proposed approach using binarised Laplacian Eigenmap (LapEig) and linear Support Vector Machine (SVM) outperforms state-of-the-art techniques significantly.

References

[1]
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 459--468, Berkeley, CA, USA, 2006.
[2]
S. Baluja and M. Covell. Learning to hash: Forgiving hash functions and applications. Data Mining and Knowledge Discovery (DMKD), 17(3):402--430, 2008.
[3]
M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373--1396, 2003.
[4]
S. Belongie, C. Fowlkes, F. Chung, and J. Malik. Spectral partitioning with indefinite kernels using the nystrom extension. In Proceedings of the 7th European Conference on Computer Vision (ECCV), pages 531--542, Copenhagen, Denmark, 2002.
[5]
M.W. Berry, S. T. Dumais, and G. W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573--595, 1995.
[6]
F. R. K. Chung. Spectral Graph Theory. American Mathematical Society, 1997.
[7]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press and McGraw-Hill, 2nd edition, 2001.
[8]
S. Deerwester, S. T. Dumais, G.W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science (JASIS), 41(6):391--407, 1990.
[9]
P. Drineas and M. W. Mahoney. On the nystrom method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research (JMLR), 6:2153--2175, 2005.
[10]
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th ACM International Conference on Information and Knowledge Management (CIKM), pages 148--155, Bethesda, MD, 1998.
[11]
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871--1874, 2008.
[12]
G. H. Golub and C. F. V. Loan. Matrix Computations. The Johns Hopkins University Press, 3rd edition, 1996.
[13]
L. W. Hagen and A. B. Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE Transactions on CAD of Integrated Circuits and Systems, 11(9):1074--1085, 1992.
[14]
A. Y. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8--12, 2009.
[15]
J. Hays and A. A. Efros. Scene completion using millions of photographs. ACM Transactions on Graphics (TOG), 26(3):4, 2007.
[16]
X. He, D. Cai, H. Liu, and W.-Y. Ma. Locality preserving indexing for document representation. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 96--103, Sheffield, UK, 2004.
[17]
X. He and P. Niyogi. Locality preserving projections. In Advances in Neural Information Processing Systems (NIPS), volume 16, pages 153--160, Vancouver and Whistler, Canada, 2003.
[18]
M. R. Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 284--291, Seattle, WA, USA, 2006.
[19]
G. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527--1554, 2006.
[20]
G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, July 2006.
[21]
C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th International Conference on Machine Learning (ICML), pages 408--415, Helsinki, Finland, 2008.
[22]
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (ECML), pages 137--142, Chemnitz, Germany, 1998.
[23]
T. Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
[24]
T. Joachims. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 217--226, Philadelphia, PA, 2006.
[25]
D. Knuth. The Art of Computer Programming. Addison-Wesley, 3rd edition, 1997.
[26]
Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 426--434, Las Vegas, NV, USA, 2008.
[27]
K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the 12th International Conference on Machine Learning (ICML), pages 331--339, Tahoe City, CA, 1995.
[28]
M. S. Lew, N. Sebe, C. Djeraba, and R. Jain. Content-based multimedia information retrieval: State of the art and challenges. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), 2(1):1--19, 2006.
[29]
J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 155--162, Boston, MA, USA, 2009.
[30]
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[31]
T. Mitchell. Machine Learning. McGraw Hill, international edition, 1997.
[32]
S. Pandey, A. Broder, F. Chierichetti, V. Josifovski, R. Kumar, and S. Vassilvitskii. Nearest-neighbor caching for content-match applications. In Proceedings of the 18th International Conference on World Wide Web (WWW), pages 441--450, Madrid, Spain, 2009.
[33]
R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: Transfer learning from unlabeled data. In Proceedings of the 24th International Conference on Machine Learning (ICML), pages 759--766, Corvalis, OR, USA, 2007.
[34]
R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate Reasoning (IJAR), 50(7):969--978, 2009.
[35]
R. E. Schapire. The boosting approach to machine learning: An overview. In Nonlinear Estimation and Classification. Springer, 2003.
[36]
B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
[37]
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.
[38]
G. Shakhnarovich, P. A. Viola, and T. Darrell. Fast pose estimation with parameter-sensitive hashing. In Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV), pages 750--759, Nice, France, 2003.
[39]
C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379--423, 623--656, 1948.
[40]
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 22(8):888--905, 2000.
[41]
Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and S. Vishwanathan. Hash kernels for structured data. Journal of Machine Learning Research (JMLR), 10:2615--2637, 2009.
[42]
B. Stein. Principles of hash-based text retrieval. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 527--534, Amsterdam, The Netherlands, 2007.
[43]
B. Stein, S. M. zu Eissen, and M. Potthast. Strategies for retrieving plagiarized documents. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 825--826, Amsterdam, The Netherlands, 2007.
[44]
A. B. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 1--8, Anchorage, AK, USA, 2008.
[45]
R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of 24th International Conference on Very Large Data Bases (VLDB), pages 194--205, New York City, USA, 1998.
[46]
P. Wegner. A technique for counting ones in a binary computer. Communications of the ACM (CACM), 3(5):322, 1960.
[47]
K. Q. Weinberger, A. Dasgupta, J. Langford, A. J. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), page 140, Montreal, Quebec, Canada, 2009.
[48]
Y. Weiss, A. B. Torralba, and R. Fergus. Spectral hashing. In Advances in Neural Information Processing Systems (NIPS), volume 21, pages 1753--1760, Vancouver, Canada, 2008.
[49]
Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 42--49, Berkeley, CA, 1999.
[50]
D. Zhang, J. Wang, D. Cai, and J. Lu. Laplacian co-hashing of terms and documents. In Proceedings of the 32nd European Conference on IR Research (ECIR), page 577--580, Milton Keynes, UK, 2010.

Cited By

View all
  • (2024)HybridHash: Hybrid Convolutional and Self-Attention Deep Hashing for Image RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658014(824-832)Online publication date: 30-May-2024
  • (2024)Efficient Unsupervised Video Hashing With Contextual Modeling and Structural ControllingIEEE Transactions on Multimedia10.1109/TMM.2024.336892426(7438-7450)Online publication date: 22-Feb-2024
  • (2024)Adversarial Modality Alignment Network for Cross-Modal Molecule RetrievalIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.32545185:1(278-289)Online publication date: Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
July 2010
944 pages
ISBN:9781450301534
DOI:10.1145/1835449
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. laplacian eigenmap
  2. semantic hashing
  3. similarity search
  4. support vector machine

Qualifiers

  • Research-article

Conference

SIGIR '10
Sponsor:

Acceptance Rates

SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)1
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)HybridHash: Hybrid Convolutional and Self-Attention Deep Hashing for Image RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658014(824-832)Online publication date: 30-May-2024
  • (2024)Efficient Unsupervised Video Hashing With Contextual Modeling and Structural ControllingIEEE Transactions on Multimedia10.1109/TMM.2024.336892426(7438-7450)Online publication date: 22-Feb-2024
  • (2024)Adversarial Modality Alignment Network for Cross-Modal Molecule RetrievalIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.32545185:1(278-289)Online publication date: Jan-2024
  • (2024)Deep cross-modal hashing with multi-task latent space learningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108944136(108944)Online publication date: Oct-2024
  • (2024)Meta-DPSTL: meta learning-based differentially private self-taught learningInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02134-215:9(4021-4053)Online publication date: 9-May-2024
  • (2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 22-Mar-2023
  • (2023)Transformer-Based Deep Hashing Method for Multi-Scale Feature FusionICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10094794(1-5)Online publication date: 4-Jun-2023
  • (2023)Identifying the Style of Chatting2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC58517.2023.10317307(1085-1092)Online publication date: 31-Oct-2023
  • (2023)Multi-feature Fusion-Based Central Similarity Deep Supervised HashingPattern Recognition and Computer Vision10.1007/978-981-99-8540-1_27(335-345)Online publication date: 25-Dec-2023
  • (2022)Duplicate Image Representation Based on Semi-Supervised LearningInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.30157814:1(1-13)Online publication date: 29-Jun-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media