Abstract
Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. A core problem in language model estimation is smoothing, which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. In this paper, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collection.
- A. Berger and J. Lafferty (1999). "Information retrieval as statistical translation," In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222--229. Google ScholarDigital Library
- S. F. Chen and J. Goodman (1998). "An empirical study of smoothing techniques for language modeling," Tech. Rep. TR-10-98, Harvard University.Google Scholar
- N. Fuhr (1992). "Probabilistic models in information retrieval", The Computer Journal, Vol.35, No.3, pp. 243--255. Google ScholarDigital Library
- I. J. Good (1953). "The Population Frequencies of Species and the Estimation of Population Parameters," Biometrika, Volume 40, parts 3,4, pp. 237--264.Google ScholarCross Ref
- D. Hiemstra and W. Kraaij (1998). "Twenty-one at TREC- 7: Ad-hoc and cross-language track," in Proc. of Seventh Text REtrieval Conference (TREC-7), Gaithersburg, MD.Google Scholar
- F. Jelinek and R. Mercer (1980). "Interpolated estimation of Markov source parameters from sparse data". In Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal (editors), pages 381--402. North Holland, Amsterdam.Google Scholar
- S. M. Katz (1987). "Estimation of probabilities from sparse data for the language model component of a speech recognizer," IEEE Transactions on Acoustics, Speech and Signal Processing, volume ASSP-35, pages 400--401, March 1987. Google ScholarCross Ref
- R. Kneser and H. Ney (1995). "Improved smoothing for mgram language modeling," in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Detroit, MI.Google Scholar
- MacKay, D. and Peto, L. (1995). "A hierarchical Dirichlet language model." Natural Language Engineering, 1(3), pp. 289--307. Google ScholarCross Ref
- D. H. Miller, T. Leek, and R. Schwartz (1999). "A hidden Markov model information retrieval system," In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214--221. Google ScholarDigital Library
- H. Ney, U. Essen, and R. Kneser (1994). "On structuring probabilistic dependencies in stochastic language modeling," Computer Speech and Language, 8:1--38. Google ScholarCross Ref
- J. Ponte (1998). A language modeling approach to information retrieval. Ph.D. thesis, University of Massachusetts at Amherst.Google ScholarDigital Library
- J. Ponte and W. B. Croft (1998). "A language modeling approach to information retrieval," Proceedings of the ACM SIGIR, pp. 275--281. Google ScholarDigital Library
- C. J. van Rijsbergen (1986). "A Non-classical Logic for Information Retrieval," The Computer Journal, 29(6). Google ScholarCross Ref
- S. E. Robertson, C. J. van-Rijsbergen, and M. F. Porter (1981). "Probabilistic models of indexing and searching", in Oddy R. N. et al. (Eds.)I nformation Retrieval Research, Butterworths, London, 1981, pp. 35--56.Google Scholar
- S. E. Robertson, S. Walker, S. Jones, M. M. Hancock- Beaulieu, and M. Gatford (1995). "Okapi at TREC-3," The Third Text REtrieval Conference (TREC-3), in D. K. Harman (ed), NIST Special Publication.Google Scholar
- G. Salton and C.Buckley (1988). "Term-weighting approaches in automatic text retrieval," Information Processing and Management, 24, pp. 513--523. Google ScholarDigital Library
- G. Salton and C. Buckley (1990), "Improving retrieval performance by relevance feedback", Journal of the American Society for Information Science, Vol. 44, No. 4, 288--297. Google ScholarCross Ref
- A. Singhal, C. Buckley, and M. Mitra (1996). "Pivoted document length normalization," in Proceedings of the 1996 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21--29. Google ScholarDigital Library
- F. Song and B. Croft (1999). "A general language model for information retrieval," in Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 279--280. Google ScholarDigital Library
- K. Sparck Jones (1997). Readings in Information Retrieval, P. Willett, ed., Morgan Kaufmann Publishers.Google ScholarDigital Library
- S. K. M. Wong and Y. Y. Yao (1995), "On modeling information retrieval with probabilistic inference," ACM Transactions on Information Systems, 13(1), pp. 69--99. Google ScholarDigital Library
Index Terms
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval
Recommendations
A study of smoothing methods for language models applied to Ad Hoc information retrieval
SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrievalLanguage modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech ...
A study of smoothing methods for language models applied to information retrieval
Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech ...
Utilizing passage-based language models for ad hoc document retrieval
AbstractTo cope with the fact that, in the ad hoc retrieval setting, documents relevant to a query could contain very few (short) parts (passages) with query-related information, researchers proposed passage-based document ranking approaches. We show that ...
Comments