skip to main content
10.1145/1390334.1390408acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

A new probabilistic retrieval model based on the dirichlet compound multinomial distribution

Authors Info & Claims
Published:20 July 2008Publication History

ABSTRACT

The classical probabilistic models attempt to capture the Ad hoc information retrieval problem within a rigorous probabilistic framework. It has long been recognized that the primary obstacle to effective performance of the probabilistic models is the need to estimate a relevance model. The Dirichlet compound multinomial (DCM) distribution, which relies on hierarchical Bayesian modeling techniques, or the Polya Urn scheme, is a more appropriate generative model than the traditional multinomial distribution for text documents. We explore a new probabilistic model based on the DCM distribution, which enables efficient retrieval and accurate ranking. Because the DCM distribution captures the dependency of repetitive word occurrences, the new probabilistic model is able to model the concavity of the score function more effectively. To avoid the empirical tuning of retrieval parameters, we design several parameter estimation algorithms to automatically set model parameters. Additionally, we propose a pseudo-relevance feedback algorithm based on the latent mixture modeling of the Dirichlet compound multinomial distribution to further improve retrieval accuracy. Finally, our experiments show that both the baseline probabilistic retrieval algorithm based on the DCM distribution and the corresponding pseudo-relevance feedback algorithm outperform the existing language modeling systems on several TREC retrieval tasks.

References

  1. C. Elkan. Clustering documents with an exponential family approximation of the dirichlet compound multinomial distribution. In ICML, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. H. Fang, T. Tao, and C. Zhai. A formal study of information retrieval heuristics. In Proceedings of the 26th ACM SIGIR conference, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F.Song and W.B.Croft. A general language model for information retrieval. In SIGIR, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Harter. A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science, 25(5), 1975.Google ScholarGoogle Scholar
  5. N. Johnson, S. Kotz, and N. Balakrishnan. Discrete multivariate distributions. John Wiley and Sons, 1997.Google ScholarGoogle Scholar
  6. J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th SIGIR, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Lafferty and C. Zhai. Probabilistic relevance models based on document and query generation. Language Modeling for Information Retrieval, Kluwer International Series on Information Retrieval, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. V. Lavrenko and W. B. Croft. Relevance-based language models. In 24th SIGIR Conference, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In Proceedings of 10th European Conference on Machine Learning, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Madsen, D. Kauchak, and C.Elkan. Modeling word burstiness using the dirichlet distribution. In Proceedings of 22nd ICML Conference, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Minka. Estimating a dirichlet distribution. Technical report, Microsoft Research, 2003.Google ScholarGoogle Scholar
  12. J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21th ACM SIGIR Conference, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Robertson and K. S. Jones. Relevance weighting of search term. Journal of the American Society for Information Science, 27, 1976.Google ScholarGoogle Scholar
  14. S. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Rocchio. Relevance feedback in information retrieval. In The Smart System: experiments in automatic document processing. Prentice Hall, 1971.Google ScholarGoogle Scholar
  16. S.E.Robertson. The probability ranking principle in ir. Journal of Documentation, 33, 1977.Google ScholarGoogle Scholar
  17. T. Tao and C. Zhai. Regularized estimation of mixture models for robust pseudo relevance feedback. In Proceedings of the 26th ACM SIGIR conference, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Ueda and R. Nakano. Deterministic annealing EM algorithm. Neural Networks, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the 10th CIKM Conference, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In proceedings of SIGIR conference, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Zhai and J. Lafferty. Two-stage language models for information retrieval. In SIGIR, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A new probabilistic retrieval model based on the dirichlet compound multinomial distribution

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
      July 2008
      934 pages
      ISBN:9781605581644
      DOI:10.1145/1390334

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 July 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader