research-article

A new probabilistic retrieval model based on the dirichlet compound multinomial distribution

Authors:
Zuobing Xu

University of California, Santa Cruz, Santa Cruz, CA, USA

University of California, Santa Cruz, Santa Cruz, CA, USA
View Profile

,
Ram Akella

University of California, Santa Cruz, Santa Cruz, CA, USA

University of California, Santa Cruz, Santa Cruz, CA, USA
View Profile

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrievalJuly 2008Pages 427–434https://doi.org/10.1145/1390334.1390408

Published:20 July 2008Publication History

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Pages 427–434

ABSTRACT

The classical probabilistic models attempt to capture the Ad hoc information retrieval problem within a rigorous probabilistic framework. It has long been recognized that the primary obstacle to effective performance of the probabilistic models is the need to estimate a relevance model. The Dirichlet compound multinomial (DCM) distribution, which relies on hierarchical Bayesian modeling techniques, or the Polya Urn scheme, is a more appropriate generative model than the traditional multinomial distribution for text documents. We explore a new probabilistic model based on the DCM distribution, which enables efficient retrieval and accurate ranking. Because the DCM distribution captures the dependency of repetitive word occurrences, the new probabilistic model is able to model the concavity of the score function more effectively. To avoid the empirical tuning of retrieval parameters, we design several parameter estimation algorithms to automatically set model parameters. Additionally, we propose a pseudo-relevance feedback algorithm based on the latent mixture modeling of the Dirichlet compound multinomial distribution to further improve retrieval accuracy. Finally, our experiments show that both the baseline probabilistic retrieval algorithm based on the DCM distribution and the corresponding pseudo-relevance feedback algorithm outperform the existing language modeling systems on several TREC retrieval tasks.

References

C. Elkan. Clustering documents with an exponential family approximation of the dirichlet compound multinomial distribution. In ICML, 2006. Google ScholarDigital Library
H. Fang, T. Tao, and C. Zhai. A formal study of information retrieval heuristics. In Proceedings of the 26th ACM SIGIR conference, 2004. Google ScholarDigital Library
F.Song and W.B.Croft. A general language model for information retrieval. In SIGIR, 1999. Google ScholarDigital Library
S. Harter. A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science, 25(5), 1975.Google Scholar
N. Johnson, S. Kotz, and N. Balakrishnan. Discrete multivariate distributions. John Wiley and Sons, 1997.Google Scholar
J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th SIGIR, 2001. Google ScholarDigital Library
J. Lafferty and C. Zhai. Probabilistic relevance models based on document and query generation. Language Modeling for Information Retrieval, Kluwer International Series on Information Retrieval, 2003. Google ScholarDigital Library
V. Lavrenko and W. B. Croft. Relevance-based language models. In 24th SIGIR Conference, 2001. Google ScholarDigital Library
D. Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In Proceedings of 10th European Conference on Machine Learning, 1998. Google ScholarDigital Library
R. Madsen, D. Kauchak, and C.Elkan. Modeling word burstiness using the dirichlet distribution. In Proceedings of 22nd ICML Conference, 2005. Google ScholarDigital Library
T. Minka. Estimating a dirichlet distribution. Technical report, Microsoft Research, 2003.Google Scholar
J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21th ACM SIGIR Conference, 1998. Google ScholarDigital Library
S. Robertson and K. S. Jones. Relevance weighting of search term. Journal of the American Society for Information Science, 27, 1976.Google Scholar
S. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR, 1994. Google ScholarDigital Library
J. Rocchio. Relevance feedback in information retrieval. In The Smart System: experiments in automatic document processing. Prentice Hall, 1971.Google Scholar
S.E.Robertson. The probability ranking principle in ir. Journal of Documentation, 33, 1977.Google Scholar
T. Tao and C. Zhai. Regularized estimation of mixture models for robust pseudo relevance feedback. In Proceedings of the 26th ACM SIGIR conference, 2006. Google ScholarDigital Library
N. Ueda and R. Nakano. Deterministic annealing EM algorithm. Neural Networks, 1998. Google ScholarDigital Library
C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the 10th CIKM Conference, 2001. Google ScholarDigital Library
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In proceedings of SIGIR conference, 2001. Google ScholarDigital Library
C. Zhai and J. Lafferty. Two-stage language models for information retrieval. In SIGIR, 2002. Google ScholarDigital Library

Index Terms

A new probabilistic retrieval model based on the dirichlet compound multinomial distribution
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution
ICML '06: Proceedings of the 23rd international conference on Machine learning

The Dirichlet compound multinomial (DCM) distribution, also called the multivariate Polya distribution, is a model for text documents that takes into account burstiness: the fact that if a word occurs once in a document, it is likely to occur ...
Read More
An introduction to the imprecise Dirichlet model for multinomial data

The imprecise Dirichlet model (IDM) was recently proposed by Walley as a model for objective statistical inference from multinomial data with chances @q. In the IDM, prior or posterior uncertainty about @q is described by a set of Dirichlet ...
Read More
Eliciting Dirichlet and Gaussian copula prior distributions for multinomial models

In this paper, we propose novel methods of quantifying expert opinion about prior distributions for multinomial models. Two different multivariate priors are elicited using median and quartile assessments of the multinomial probabilities. First, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
July 2008
934 pages
ISBN:9781605581644
DOI:10.1145/1390334
General Chairs:
Tat-Seng Chua
National University of Singapore
,
Mun-Kew Leong
National Library Board, Singapore
,
Program Chairs:
Syung Hyon Myaeng
Information and Communications University, Korea
,
Douglas W. Oard
University of Maryland, College Park, USA
,
Fabrizio Sebastiani
Consiglio Nazionale delle Ricerche, Italy
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 July 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dirichlet distribution
language model
multinomial distribution
probabilistic retrieval model
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 810
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A new probabilistic retrieval model based on the dirichlet compound multinomial distribution

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

An introduction to the imprecise Dirichlet model for multinomial data

Eliciting Dirichlet and Gaussian copula prior distributions for multinomial models