skip to main content
column

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval

Published:02 August 2017Publication History
Skip Abstract Section

Abstract

Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. A core problem in language model estimation is smoothing, which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. In this paper, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collection.

References

  1. A. Berger and J. Lafferty (1999). "Information retrieval as statistical translation," In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. F. Chen and J. Goodman (1998). "An empirical study of smoothing techniques for language modeling," Tech. Rep. TR-10-98, Harvard University.Google ScholarGoogle Scholar
  3. N. Fuhr (1992). "Probabilistic models in information retrieval", The Computer Journal, Vol.35, No.3, pp. 243--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. I. J. Good (1953). "The Population Frequencies of Species and the Estimation of Population Parameters," Biometrika, Volume 40, parts 3,4, pp. 237--264.Google ScholarGoogle ScholarCross RefCross Ref
  5. D. Hiemstra and W. Kraaij (1998). "Twenty-one at TREC- 7: Ad-hoc and cross-language track," in Proc. of Seventh Text REtrieval Conference (TREC-7), Gaithersburg, MD.Google ScholarGoogle Scholar
  6. F. Jelinek and R. Mercer (1980). "Interpolated estimation of Markov source parameters from sparse data". In Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal (editors), pages 381--402. North Holland, Amsterdam.Google ScholarGoogle Scholar
  7. S. M. Katz (1987). "Estimation of probabilities from sparse data for the language model component of a speech recognizer," IEEE Transactions on Acoustics, Speech and Signal Processing, volume ASSP-35, pages 400--401, March 1987. Google ScholarGoogle ScholarCross RefCross Ref
  8. R. Kneser and H. Ney (1995). "Improved smoothing for mgram language modeling," in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Detroit, MI.Google ScholarGoogle Scholar
  9. MacKay, D. and Peto, L. (1995). "A hierarchical Dirichlet language model." Natural Language Engineering, 1(3), pp. 289--307. Google ScholarGoogle ScholarCross RefCross Ref
  10. D. H. Miller, T. Leek, and R. Schwartz (1999). "A hidden Markov model information retrieval system," In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Ney, U. Essen, and R. Kneser (1994). "On structuring probabilistic dependencies in stochastic language modeling," Computer Speech and Language, 8:1--38. Google ScholarGoogle ScholarCross RefCross Ref
  12. J. Ponte (1998). A language modeling approach to information retrieval. Ph.D. thesis, University of Massachusetts at Amherst.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Ponte and W. B. Croft (1998). "A language modeling approach to information retrieval," Proceedings of the ACM SIGIR, pp. 275--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. J. van Rijsbergen (1986). "A Non-classical Logic for Information Retrieval," The Computer Journal, 29(6). Google ScholarGoogle ScholarCross RefCross Ref
  15. S. E. Robertson, C. J. van-Rijsbergen, and M. F. Porter (1981). "Probabilistic models of indexing and searching", in Oddy R. N. et al. (Eds.)I nformation Retrieval Research, Butterworths, London, 1981, pp. 35--56.Google ScholarGoogle Scholar
  16. S. E. Robertson, S. Walker, S. Jones, M. M. Hancock- Beaulieu, and M. Gatford (1995). "Okapi at TREC-3," The Third Text REtrieval Conference (TREC-3), in D. K. Harman (ed), NIST Special Publication.Google ScholarGoogle Scholar
  17. G. Salton and C.Buckley (1988). "Term-weighting approaches in automatic text retrieval," Information Processing and Management, 24, pp. 513--523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Salton and C. Buckley (1990), "Improving retrieval performance by relevance feedback", Journal of the American Society for Information Science, Vol. 44, No. 4, 288--297. Google ScholarGoogle ScholarCross RefCross Ref
  19. A. Singhal, C. Buckley, and M. Mitra (1996). "Pivoted document length normalization," in Proceedings of the 1996 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. F. Song and B. Croft (1999). "A general language model for information retrieval," in Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 279--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. K. Sparck Jones (1997). Readings in Information Retrieval, P. Willett, ed., Morgan Kaufmann Publishers.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. K. M. Wong and Y. Y. Yao (1995), "On modeling information retrieval with probabilistic inference," ACM Transactions on Information Systems, 13(1), pp. 69--99. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGIR Forum
          ACM SIGIR Forum  Volume 51, Issue 2
          SIGIR Test-of-Time Awardees 1978-2001
          July 2017
          276 pages
          ISSN:0163-5840
          DOI:10.1145/3130348
          • Editors:
          • Donna Harman,
          • Diane Kelly
          Issue’s Table of Contents

          Copyright © 2017 Copyright is held by the owner/author(s)

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 2 August 2017

          Check for updates

          Qualifiers

          • column

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader