skip to main content
10.1145/1458082.1458105acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

An empirical study of required dimensionality for large-scale latent semantic indexing applications

Published: 26 October 2008 Publication History

Abstract

The technique of latent semantic indexing is used in a wide variety of commercial applications. In these applications, the processing time and RAM required for SVD computation, and the processing time and RAM required during LSI retrieval operations are all roughly linear in the number of dimensions, k, chosen for the LSI representation space. In large-scale commercial LSI applications, reducing k values could be of significant value in reducing server costs. This paper explores the effects of varying dimensionality.
The approach taken here focuses on term comparisons. Pairs of terms are considered which have strong real-world associations. The proximities of members of these pairs in the LSI space are compared at multiple values of k. The testing is carried out for collections of from one to five million documents. For the five million document collection, a value of k ≈ 400 provides the best performance.
The results suggest that there is something of an 'island of stability' in the k = 300 to 500 range. The results also indicate that there is relatively little room to employ k values outside of this range without incurring significant distortions in at least some term-term correlations.

References

[1]
Deerwester, S. et al 1988. Improving information retrieval with latent semantic indexing. In Proceedings of the 51st Annual Meeting of the American Society for Information Science, 36--40.
[2]
Dumais, S. 2004. Latent Semantic Analysis. In ARIST Review of Information Science and Technology, vol. 38, 2004, Chapter 4.
[3]
Landauer, T. and Dumais, S. 1997. A solution to Plato's problem: the latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104 (1997), 211--240.
[4]
Medline standard IR test collection available at http://ir.dcs.gla.ac.uk/resources/test_collections/medl/
[5]
Landauer, T. et al, eds., 2007. Handbook of Latent Semantic Analysis, Lawrence Erlbaum Associates, Publishers, 81--83.
[6]
Skillicorn, D. 2007. Understanding Complex Datasets. Taylor and Francis Publishing, 63--65.
[7]
Zhu, M. and Ghodsi, A. 2006. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comp. Statistics and Data Analysis, vol. 51, #2, 918--930.
[8]
Ding, C. 1999. A dual probabilistic model for latent semantic indexing in information retrieval and filtering. Updated version of paper from SIGIR 99 at: citeseer.ist.psu.edu/ding99dual.html.
[9]
Efron, M. 2005. Eigenvalue-based model selection during latent semantic indexing. JASIS, 56(9), 2005, 969--988.
[10]
Deerwester, S., et al. 1990. Indexing by latent semantic analysis. Journal of the ASIS 41(6) 1990, 391--407.
[11]
Hull, D. 1994. Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of the 17th annual International ACM SIGIR Conference (Dublin, Ireland 1994), 282--291.
[12]
Young, P. 1994. Cross-Language Information Retrieval using Latent Semantic Indexing. Master's Thesis, University of Tennessee. Report #UT-CS-94-259.
[13]
Syu, I., Lang, S., and Deo, N. 1996. Incorporating latent semantic indexing into a neural network model for information retrieval. In Proceedings of the Fifth CIKM Conference (Rockville, Maryland) 1996, 145--153.
[14]
Wu, S., Yang, P., and Soo, V. 1998. An assessment of character-based Chinese news filtering using latent semantic indexing. Computational Linguistics and Chinese Language Processing, vol. 3, #2 (August, 1998), 61--78.
[15]
Yang, Y. et al 1998. Translingual information retrieval: learning from bilingual corpora. Artificial Intelligence 103 (1998) 323--345.
[16]
Zha, H. 1998. A Subspace-Based Model for Information Retrieval with Applications in Latent Semantic Indexing. Technical Report No. CSE-98-002, Department of Computer Science and Engineering, Pennsylvania State University.
[17]
Jiang, F. et al, 1999. Efficient Singular Value Decomposition via Improved Document Sampling. Technical Report CS-1999-5, Department of Computer Science, Duke University, 2 February 1999.
[18]
Lerman, K. 1999. Unpublished paper available at: http://www.isi.edu/~lerman/papers/Lerman99.pdf
[19]
Wiemer-Hastings, P. et al. 1999. Improving an intelligent tutor's comprehension of students with latent semantic analysis. Artificial Intelligence in Education, Amsterdam, IOS Press, 535--542.
[20]
Jiang, F. and Littman, M. 2000. Approximate dimension equalization in vector-based information retrieval. In Proceedings, 17th Int. Conf. on Machine Learning, 423--430.
[21]
Kanerva, P., Kristoferson, J., and Holst, A. 2000. Random indexing of text samples for latent semantic analysis. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society, 2000, 103--106.
[22]
Wiemer-Hastings, P. 2000. Adding syntactic information to LSA. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society, Erlbaum, Mahwah, NJ, 989--993.
[23]
Ye, Y. 2000. Comparing Matrix Methods in Text-based Information Retrieval. Tech Rept, School of Mathematical Sciences, Peking University http://dean.pku.edu.cn/bksky/2000jzlwj/39.pdf
[24]
Caron, J. 2001. Experiments with LSA scoring: optimal rank and basis. Computational Information Retrieval, SIAM Publishing, 157--169
[25]
Husbands, P., Simon, H., and Ding, C. 2001. On the use of the singular value decomposition for text retrieval. In Proceedings of the SIAM Computational Information Retrieval Workshop, October, 2000, 145--156.
[26]
Jessup, E. and Martin, J. 2001. Taking a new look at the latent semantic analysis approach to information retrieval. Computational Info. Retrieval, SIAM Publishing, 121--144.
[27]
Lizza, M. and Sartoretto, F. 2001. A comparative analysis of LSI strategies. Computational Information Retrieval, SIAM Publishing, 2001, 171--181.
[28]
Torkkola, K. 2001. Linear discriminant analysis in document classification. In Proceedings, IEEE ICDM Workshop on Text Mining.
[29]
Buckeridge, A. and Sutcliffe, R. 2002. Disambiguating noun compounds with latent semantic indexing. In Proceedings, Second International Workshop on Computational Terminology Volume 14, 1--7.
[30]
Cheng, B. 2002. Towards Understanding Latent Semantic Indexing. M.Sc. Thesis, University of Alberta.
[31]
Olde, B. et al 2002. The right stuff: do you need to sanitize your corpus when using latent semantic analysis? In Proceedings of the 24th Annual Meeting of the Cognitive Science Society, 708--713.
[32]
Dumais, S. 2003. Data-driven approaches to information access. Cognitive Science, 27, (2003), 491--524.
[33]
Gee, K. 2003. Using latent semantic indexing to filter spam. In Proceedings of the 2003 ACM Symposium on Applied Computing, 2003, 460--464.
[34]
Kim, Y.-S., Chang, J.-H., and Zhang, B.-T. 2003. An empirical study on dimensionality optimization in text mining for linguistic knowledge acquisition. PAKDD, Springer LNAI 2637, 111--116.
[35]
Lin, J., and Gunopulos, D. 2003. Dimensionality reduction by random projection and latent semantic indexing. Text Mining Workshop, at the 3rd SIAM International Conference on Data Mining,
[36]
Price, R., 2003. Personal communication.
[37]
Singh, S., Hull, D., and Fluder, E. 2003. Text influenced molecular indexing (TIMT): a literature database mining approach that handles text and chemistry. J. Chem. Inf. Comput. Sci. 43 (2003), 743--752.
[38]
Turney, P., and Littman, M. 2003. Measuring praise and criticism: inference of semantic orientation from association. ACM Transactions on Info. Systems, 21 (2003), 315--346.
[39]
Dobsa, J., and Basic, B. 2004. Comparison of information retrieval techniques: latent semantic indexing and concept indexing. Journal of Information and Organizational Sciences 28 (2004), 1--17.
[40]
Dupret, G. 2004. Latent semantic indexing with a variable number of orthogonal factors. In Proceedings RIAO'04, 26--28.
[41]
He, X. et al. 2004. Locality preserving indexing for document representation. In Proceedings of the 27th annual International ACM SIGIR Conference (Sheffield, United Kingdom), 96--103.
[42]
Pincombe, B. 2004. Comparison of Human and Latent Semantic Analysis (LSA) Judgments of Pairwise Document Similarities for a News Corpus. Defence Science and Technology Organization, Australia, Report #DSTO-RR-0278, Sept., 2004.
[43]
Shima, K., Todoriki, M., and Suzuki, A. 2004. SVM-based feature selection of latent semantic features. Pattern Recognition Letters 25 (2004), 1051--1057.
[44]
Elsas, J. 2005. An Evaluation of Projection Techniques for Document Clustering: Latent Semantic Analysis and Independent Component Analysis. Master's Thesis, School of Information and Library Science, University of North Carolina at Chapel Hill, July, 2005.
[45]
Moldovan, A., Bot, R., and Wanka, G. 2005. Latent semantic indexing for patent documents. Int. J. Appl. Math Comp. Sci. vol. 15 #4, 551--560.
[46]
Moravec, P. 2005. Testing dimension reduction methods for text retrieval. In Proceedings Annual International Workshop on DAtabases, TExts, Specifications and Objects, (Desna, Czech Republic), April 13-15, 2005, 113--124.
[47]
Tang, B. et al 2005. Comparing and combining dimension reduction techniques for efficient text clustering. In Proceedings International Workshop on Feature Selection for Data Mining: Interfacing Machine Learning and Statistics, (Newport Beach, CA) April 23, 2005, 17--26.
[48]
Yu, K., Yu, S., and Tresp, V. 2005. Multi-label informed latent semantic indexing. In Proceedings, SIGIR 2005 (Salvador, Brazil), 15-19 August, 2005, 258--265.
[49]
Geis, J. 2006. Latent Semantic Indexing and Information Retrieval: a Quest with BosSE. Master's Thesis, Ruprecht-Karls University, Heidelberg, 18 Jan 2006.
[50]
Kontostathis, A., and Pottenger, W. 2006. A framework for understanding LSI performance. Information Processing and Management, vol. 42, #1, 56--73.
[51]
Kumar, C. and Srinivas, S. 2006. Latent semantic indexing using eigenvalue analysis for efficient information retrieval. International Journal of Applied Mathematics and Computer Science, vol 16 #4, 2006, 551--558.
[52]
Budiu, R., Royer, C., and Pirolli, P. 2007. Modeling information scent: a comparison of LSA, PMI and GLSA similarity measures on common tests and corpora. In Proceedings of RIAO'07 Pittsburgh, PA, May 2007.
[53]
Dumais, S. 2007. LSA and information retrieval: getting back to basics. Handbook of Latent Semantic Analysis, Lawrence Erlbaum Associates, NY, 293--321.
[54]
Haley, D. et al 2007. Tuning an LSA-based assessment system for short answers in the domain of computer science: the elusive optimum dimension. In Proceedings of the 1st European Workshop on Latent Semantic Analysis in Technology-enhanced Learning, (Heerland, NL) 29-30 March 2007, 22--23.
[55]
Kontostathis, A. 2007. Essential dimensions of latent semantic indexing. In Proceedings of the 40th Hawaii International Conference on Systems Sciences, 73.
[56]
Li, Y., Shawe-Taylor, J. 2007. Advanced learning algorithms for cross-language patent retrieval and classification. Information Processing and Management vol 43 #5 (Sept. 2007) 1183--1199.
[57]
Fortune, B., Mladenic, D., and Grobelnik, M. 2005. Semi-automatic construction of topic ontology. Joint International Workshops, EWMF 2005 and KDO 2005, Porto, Portugal, October 3 and 7, 2005, Springer, LNCS Volume 4289, 121--131.
[58]
Hoenkamp, E. 1998. Spotting ontological lacunae through spectrum analysis of retrieved documents. In Proceedings, Workshop on Applications of Ontologies and PSMs, Brighton, England, August 1998, 73--77.
[59]
Lodder, A., and Oskamp, A. 2006. Information Technology & Lawyers: Advanced Technology in the Legal Domain, from Challenges to Daily Routine. Springer Publishing.
[60]
Edlund, S. et al. 2008. Supervision and discovery of electronic communications in the financial services industry. In Proceedings, Workshop on Governance, Risk, and Compliance, CAiSE 2008.
[61]
Waterman, K. 2006. Knowledge Discovery in Corporate email: the Compliance Bot meets Enron. M.B.A. Thesis, MIT Sloan School of Management, 2006.
[62]
Skillicorn, D. 2004. Applying Matrix Decompositions to Counterterrorism. Queen's University School of Computing Technical Report #2004-484, May (2004).
[63]
Bradford, R. 2006. Application of latent semantic indexing in generating graphs of terrorist networks. In Proceedings, ISI 2006, (San Diego, CA), May 23-24, 2006, Springer, LNCS vol. 3975, 674--675.
[64]
Bast, H., and Mumjar, D. 2005. Why spectral retrieval works. In Proceedings SIGIR '05, 11--18.
[65]
Lindsey, R. et al. 2007. Be wary of what your computer reads: the effects of corpus selection on measuring semantic relatedness. In Proceedings of the Eighth International Conference on Cognitive Modeling, Oxford, UK, 279--284.
[66]
Dumais, S. 1991. Improving the retrieval of information from external sources. Behavior, Research Methods, Instruments, and Computers, 1991 23(2), 229--236.
[67]
Johnson, W., and Lindenstrauss, J., 1984. Extensions of Lipschitz maps into a Hilbert space. Contemporary Mathematics, vol. 26 (1984), 189--206.
[68]
Skillicorn, D., McConnell, S., and Soong, E. 2003. Handbook of Data Mining Using Matrix Decompositions., Version 0.9. School of Computing, Queen's University, Kingston, Canada.

Cited By

View all
  • (2024)Description of Crowdsourcing and AI-Based Tool for Knowledge Management and Systems Change in Public ServicesInternational Journal of Innovation and Technology Management10.1142/S021987702450027521:04Online publication date: 27-Jan-2024
  • (2024)Research constituent, intellectual structure and current trends in environmental sustainability-an analytical retrospectiveDiscover Sustainability10.1007/s43621-024-00286-35:1Online publication date: 12-Sep-2024
  • (2024)Identifying Research Topics in Human-Computer Interaction for Development: What Value Can Natural Language Processing Techniques Add?Intelligent Systems and Applications10.1007/978-3-031-47715-7_55(822-840)Online publication date: 30-Jan-2024
  • Show More Cited By

Index Terms

  1. An empirical study of required dimensionality for large-scale latent semantic indexing applications

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
    October 2008
    1562 pages
    ISBN:9781595939913
    DOI:10.1145/1458082
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 October 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. dimensionality
    2. latent semantic indexing
    3. lsi

    Qualifiers

    • Research-article

    Conference

    CIKM08
    CIKM08: Conference on Information and Knowledge Management
    October 26 - 30, 2008
    California, Napa Valley, USA

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Description of Crowdsourcing and AI-Based Tool for Knowledge Management and Systems Change in Public ServicesInternational Journal of Innovation and Technology Management10.1142/S021987702450027521:04Online publication date: 27-Jan-2024
    • (2024)Research constituent, intellectual structure and current trends in environmental sustainability-an analytical retrospectiveDiscover Sustainability10.1007/s43621-024-00286-35:1Online publication date: 12-Sep-2024
    • (2024)Identifying Research Topics in Human-Computer Interaction for Development: What Value Can Natural Language Processing Techniques Add?Intelligent Systems and Applications10.1007/978-3-031-47715-7_55(822-840)Online publication date: 30-Jan-2024
    • (2023)Observing LOD: Its Knowledge Domains and the Varying Behavior of Ontologies Across ThemIEEE Access10.1109/ACCESS.2023.325010511(21127-21143)Online publication date: 2023
    • (2023)Recent trends in mathematical expressions recognition: An LDA-based analysisExpert Systems with Applications10.1016/j.eswa.2022.119028213(119028)Online publication date: Mar-2023
    • (2022)Evaluation of the trends in jobs and skill-sets using data analytics: a case studyJournal of Big Data10.1186/s40537-022-00576-59:1Online publication date: 19-Mar-2022
    • (2022)Latent DIRICHLET allocation (LDA) based information modelling on BLOCKCHAIN technology: a review of trends and research patterns used in integrationMultimedia Tools and Applications10.1007/s11042-022-13500-z81:25(36805-36831)Online publication date: 1-Oct-2022
    • (2021)Record Linkage of Chinese Patent Inventors and Authors of Scientific ArticlesApplied Sciences10.3390/app1118841711:18(8417)Online publication date: 10-Sep-2021
    • (2021)Identifying Major Research Areas and Minor Research Themes of Android Malware Analysis and Detection Field Using LSAComplexity10.1155/2021/45510672021Online publication date: 1-Jan-2021
    • (2021)An Automatic Synthesizer of Advising Tools for High Performance ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.301863632:2(330-341)Online publication date: 1-Feb-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media