ABSTRACT
The classical probabilistic models attempt to capture the Ad hoc information retrieval problem within a rigorous probabilistic framework. It has long been recognized that the primary obstacle to effective performance of the probabilistic models is the need to estimate a relevance model. The Dirichlet compound multinomial (DCM) distribution, which relies on hierarchical Bayesian modeling techniques, or the Polya Urn scheme, is a more appropriate generative model than the traditional multinomial distribution for text documents. We explore a new probabilistic model based on the DCM distribution, which enables efficient retrieval and accurate ranking. Because the DCM distribution captures the dependency of repetitive word occurrences, the new probabilistic model is able to model the concavity of the score function more effectively. To avoid the empirical tuning of retrieval parameters, we design several parameter estimation algorithms to automatically set model parameters. Additionally, we propose a pseudo-relevance feedback algorithm based on the latent mixture modeling of the Dirichlet compound multinomial distribution to further improve retrieval accuracy. Finally, our experiments show that both the baseline probabilistic retrieval algorithm based on the DCM distribution and the corresponding pseudo-relevance feedback algorithm outperform the existing language modeling systems on several TREC retrieval tasks.
- C. Elkan. Clustering documents with an exponential family approximation of the dirichlet compound multinomial distribution. In ICML, 2006. Google ScholarDigital Library
- H. Fang, T. Tao, and C. Zhai. A formal study of information retrieval heuristics. In Proceedings of the 26th ACM SIGIR conference, 2004. Google ScholarDigital Library
- F.Song and W.B.Croft. A general language model for information retrieval. In SIGIR, 1999. Google ScholarDigital Library
- S. Harter. A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science, 25(5), 1975.Google Scholar
- N. Johnson, S. Kotz, and N. Balakrishnan. Discrete multivariate distributions. John Wiley and Sons, 1997.Google Scholar
- J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th SIGIR, 2001. Google ScholarDigital Library
- J. Lafferty and C. Zhai. Probabilistic relevance models based on document and query generation. Language Modeling for Information Retrieval, Kluwer International Series on Information Retrieval, 2003. Google ScholarDigital Library
- V. Lavrenko and W. B. Croft. Relevance-based language models. In 24th SIGIR Conference, 2001. Google ScholarDigital Library
- D. Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In Proceedings of 10th European Conference on Machine Learning, 1998. Google ScholarDigital Library
- R. Madsen, D. Kauchak, and C.Elkan. Modeling word burstiness using the dirichlet distribution. In Proceedings of 22nd ICML Conference, 2005. Google ScholarDigital Library
- T. Minka. Estimating a dirichlet distribution. Technical report, Microsoft Research, 2003.Google Scholar
- J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21th ACM SIGIR Conference, 1998. Google ScholarDigital Library
- S. Robertson and K. S. Jones. Relevance weighting of search term. Journal of the American Society for Information Science, 27, 1976.Google Scholar
- S. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR, 1994. Google ScholarDigital Library
- J. Rocchio. Relevance feedback in information retrieval. In The Smart System: experiments in automatic document processing. Prentice Hall, 1971.Google Scholar
- S.E.Robertson. The probability ranking principle in ir. Journal of Documentation, 33, 1977.Google Scholar
- T. Tao and C. Zhai. Regularized estimation of mixture models for robust pseudo relevance feedback. In Proceedings of the 26th ACM SIGIR conference, 2006. Google ScholarDigital Library
- N. Ueda and R. Nakano. Deterministic annealing EM algorithm. Neural Networks, 1998. Google ScholarDigital Library
- C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the 10th CIKM Conference, 2001. Google ScholarDigital Library
- C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In proceedings of SIGIR conference, 2001. Google ScholarDigital Library
- C. Zhai and J. Lafferty. Two-stage language models for information retrieval. In SIGIR, 2002. Google ScholarDigital Library
Index Terms
- A new probabilistic retrieval model based on the dirichlet compound multinomial distribution
Recommendations
Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution
ICML '06: Proceedings of the 23rd international conference on Machine learningThe Dirichlet compound multinomial (DCM) distribution, also called the multivariate Polya distribution, is a model for text documents that takes into account burstiness: the fact that if a word occurs once in a document, it is likely to occur ...
An introduction to the imprecise Dirichlet model for multinomial data
The imprecise Dirichlet model (IDM) was recently proposed by Walley as a model for objective statistical inference from multinomial data with chances @q. In the IDM, prior or posterior uncertainty about @q is described by a set of Dirichlet ...
Eliciting Dirichlet and Gaussian copula prior distributions for multinomial models
In this paper, we propose novel methods of quantifying expert opinion about prior distributions for multinomial models. Two different multivariate priors are elicited using median and quartile assessments of the multinomial probabilities. First, we ...
Comments