skip to main content
10.1145/1835449.1835505acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Estimation of statistical translation models based on mutual information for ad hoc information retrieval

Published: 19 July 2010 Publication History

Abstract

As a principled approach to capturing semantic relations of words in information retrieval, statistical translation models have been shown to outperform simple document language models which rely on exact matching of words in the query and documents. A main challenge in applying translation models to ad hoc information retrieval is to estimate a translation model without training data. Existing work has relied on training on synthetic queries generated based on a document collection. However, this method is computationally expensive and does not have a good coverage of query words. In this paper, we propose an alternative way to estimate a translation model based on normalized mutual information between words, which is less computationally expensive and has better coverage of query words than the synthetic query method of estimation. We also propose to regularize estimated translation probabilities to ensure sufficient probability mass for self-translation. Experiment results show that the proposed mutual information-based estimation method is not only more efficient, but also more effective than the synthetic query-based method, and it can be combined with pseudo-relevance feedback to further improve retrieval accuracy. The results also show that the proposed regularization strategy is effective and can improve retrieval accuracy for both synthetic query-based estimation and mutual information-based estimation.

References

[1]
J. Bai, D. Song, P. Bruza, J. Y. Nie, and G. Cao. Query expansion using term relationships in language models for information retrieval. ACM CIKM, pages 688--695, 2005.
[2]
A. Berger and J. Lafferty. Information retrieval as statistical translation. ACM SIGIR, pages 222--229, 1999.
[3]
P. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263--311, 1993.
[4]
G. Cao, J. Y. Nie, and J. Bai. Integrating word relationships into language models. ACM SIGIR, pages 298--305, 2005.
[5]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. ACM SIGKDD, 39(B):1--38, 1997.
[6]
N. Fuhr. Probabilistic models in information retrieval. The Computer Journal, 35(3):243--255, 1992.
[7]
F. Jelinek. Statistical Methods for speech recognition. MIT Press., 1997.
[8]
R. Jin, A. G. Hauptmann, and C. X. Zhai. Title language model for information retrieval. In ACM SIGIR, pages 42--48, 2002.
[9]
Y. Jing and B. Croft. An association thesaurus for information retrieval. RIAO, pages 141--160, 1994.
[10]
O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. ACM SIGIR, pages 194--201, 2004.
[11]
J. Lafferty and C. Zhai. Document language models, query models and risk minimization for information retrieval. ACM SIGIR, pages 111--119, 2001.
[12]
V. Lavrenko, M. Choquette, and B. Croft. Cross-lingual relevance models. ACM SIGIR, pages 175--182, 2002.
[13]
V. Lavrenko and B. Croft. Relevance-based language models. ACM SIGIR, pages 120--127, 2001.
[14]
M. Lesk and B. Croft. Word-word associations in document retrieval systems. American Documentation, 20:20--27, 1969.
[15]
S. Liu, F. Lin, C. Yu, and W. Meng. An effective approach to document retrieval via utilizing wordnet and recognizing phrases. ACM SIGIR, pages 266--272, 2004.
[16]
X. Liu and W. B. Croft. Cluster-based retrieval using language models. In ACM SIGIR, pages 186--193, 2004.
[17]
R. Mandala, T. tokunaga, H. Tanaka, and K. Satoh. Ad hoc retrieval experiments using wordnet and automatically constructed thesauri. TREC-7, pages 475--481, 1998.
[18]
D. Metzler, Y. Bernstein, B. Croft, A. Moffat, and J. Zobel. Similarity measures for tracking information flow. ACM CIKM, pages 517--524, 2005.
[19]
V. Murdock and B. Croft. Simple translation models for sentence retrieval in factoid question answering. ACM SIGIR, pages 31--35, 2004.
[20]
J.-Y. Nie, M. Simard, P. Isabelle, and R. Durand. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In ACM SIGIR, pages 74--81, 1999.
[21]
H. J. Peat and P. Willett. The limitations of term co-occurrence data for query expansion in document retrieval systems. J. of Information science, 42(5):378--383, 1991.
[22]
J. Ponte and W. B. Croft. A language modeling approach to information retrieval. ACM SIGIR, pages 275--281, 1998.
[23]
M. Porter. An algorithm for suffix stripping. Program, 14(3), 1980.
[24]
Y. Qiu and H. Frei. Concept based query expansion. ACM SIGIR, pages 160--169, 1993.
[25]
C. J. V. Rijbergen. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, pages 106--119, 1977.
[26]
C. J. V. Rijsbergen. Information retrieval. Butterworths, 1979.
[27]
S. Robertson and K. Sparck. Relevance weighting of search terms. Journal of American Society for Information Science, 27:129--146, 1976.
[28]
G. Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989.
[29]
G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill., 1983.
[30]
G. Salton, C. S. Yang, and C. T. Yu. A theory of term importance in automatic text analysis. Journal of American Society for Information Science, 26(1):33--44, 1975.
[31]
H. Schutze and J. O. Pedersen. A co-occurrence based thesaurus and two applications to information retrieval. Information and processing management, 33(3):307--318, 1997.
[32]
A. F. Smeaton and C. J. V. Rijsbergen. The retrieval effects of query expansion on a feedback document retrieval system. The Computer Journal, 26(3):239--246, 1983.
[33]
T. Tao, X. Wang, Q. Mei, and C. Zhai. Language model information retrieval with document expansion. In HLT-NAACL, pages 407--414, 2006.
[34]
H. Turtle and W. B. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187--222, 1991.
[35]
E. M. Voorhess. Query expansion using lexical-semantic relations. ACM SIGIR, pages 61--69, 1994.
[36]
X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In ACM SIGIR, pages 178--185, 2006.
[37]
F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80--83, 1945.
[38]
J. Xu and B. Croft. Query expansion using local and global document analysis. ACM SIGIR, pages 4--11, 1996.
[39]
J. Xu, R. Weischedel, and C. Nguyen. Evaluating a probabilistic model for cross-lingual information retrieval. ACM SIGIR, pages 105--110, 2001.
[40]
X. Xue, J. Jeon, and W. B. Croft. Retrieval models for question and answer archives. In ACM SIGIR, pages 475--482, 2008.
[41]
C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. ACM CIKM, pages 403--410, 2001.
[42]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. ACM SIGIR, pages 334--342, 2001.

Cited By

View all

Index Terms

  1. Estimation of statistical translation models based on mutual information for ad hoc information retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
    July 2010
    944 pages
    ISBN:9781450301534
    DOI:10.1145/1835449
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 July 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. estimation
    2. language models
    3. smoothing
    4. statistical machine translation

    Qualifiers

    • Research-article

    Conference

    SIGIR '10
    Sponsor:

    Acceptance Rates

    SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 27 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Myanmar News Retrieval Using Kernelized Neural Ranking Model2024 IEEE Conference on Computer Applications (ICCA)10.1109/ICCA62361.2024.10533144(1-6)Online publication date: 16-Mar-2024
    • (2024)Harnessing the Power of Metadata for Enhanced Question Retrieval in Community Question AnsweringIEEE Access10.1109/ACCESS.2024.339544912(65768-65779)Online publication date: 2024
    • (2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
    • (2023)Neural Ad-Hoc Retrieval Meets Open Information ExtractionAdvances in Information Retrieval10.1007/978-3-031-28238-6_57(655-663)Online publication date: 2-Apr-2023
    • (2022)Semantic Models for the First-Stage Retrieval: A Comprehensive ReviewACM Transactions on Information Systems10.1145/348625040:4(1-42)Online publication date: 24-Mar-2022
    • (2022)Attention-based skill translation models for expert findingExpert Systems with Applications10.1016/j.eswa.2021.116433(116433)Online publication date: Jan-2022
    • (2020)The combination of term relations analysis and weighted frequent itemset model for multidocument summarizationComputational Intelligence10.1111/coin.1227036:2(783-812)Online publication date: 29-Jan-2020
    • (2020)Improving Ranking in Document based Search Systems2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9143047(914-921)Online publication date: Jun-2020
    • (2020)Channel retrieval: finding relevant broadcasters on TelegramSocial Network Analysis and Mining10.1007/s13278-020-0629-z10:1Online publication date: 30-Mar-2020
    • (2020)Cross-lingual embedding for cross-lingual question retrieval in low-resource community question answeringMachine Translation10.1007/s10590-020-09257-734:4(287-303)Online publication date: 1-Dec-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media