research-article

Estimation of statistical translation models based on mutual information for ad hoc information retrieval

Authors:

Maryam Karimzadehgan,

ChengXiang ZhaiAuthors Info & Claims

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Pages 323 - 330

https://doi.org/10.1145/1835449.1835505

Published: 19 July 2010 Publication History

Abstract

As a principled approach to capturing semantic relations of words in information retrieval, statistical translation models have been shown to outperform simple document language models which rely on exact matching of words in the query and documents. A main challenge in applying translation models to ad hoc information retrieval is to estimate a translation model without training data. Existing work has relied on training on synthetic queries generated based on a document collection. However, this method is computationally expensive and does not have a good coverage of query words. In this paper, we propose an alternative way to estimate a translation model based on normalized mutual information between words, which is less computationally expensive and has better coverage of query words than the synthetic query method of estimation. We also propose to regularize estimated translation probabilities to ensure sufficient probability mass for self-translation. Experiment results show that the proposed mutual information-based estimation method is not only more efficient, but also more effective than the synthetic query-based method, and it can be combined with pseudo-relevance feedback to further improve retrieval accuracy. The results also show that the proposed regularization strategy is effective and can improve retrieval accuracy for both synthetic query-based estimation and mutual information-based estimation.

References

[1]

J. Bai, D. Song, P. Bruza, J. Y. Nie, and G. Cao. Query expansion using term relationships in language models for information retrieval. ACM CIKM, pages 688--695, 2005.

Digital Library

[2]

A. Berger and J. Lafferty. Information retrieval as statistical translation. ACM SIGIR, pages 222--229, 1999.

Digital Library

[3]

P. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263--311, 1993.

Digital Library

[4]

G. Cao, J. Y. Nie, and J. Bai. Integrating word relationships into language models. ACM SIGIR, pages 298--305, 2005.

Digital Library

[5]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. ACM SIGKDD, 39(B):1--38, 1997.

[6]

N. Fuhr. Probabilistic models in information retrieval. The Computer Journal, 35(3):243--255, 1992.

Digital Library

[7]

F. Jelinek. Statistical Methods for speech recognition. MIT Press., 1997.

Digital Library

[8]

R. Jin, A. G. Hauptmann, and C. X. Zhai. Title language model for information retrieval. In ACM SIGIR, pages 42--48, 2002.

Digital Library

[9]

Y. Jing and B. Croft. An association thesaurus for information retrieval. RIAO, pages 141--160, 1994.

[10]

O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. ACM SIGIR, pages 194--201, 2004.

Digital Library

[11]

J. Lafferty and C. Zhai. Document language models, query models and risk minimization for information retrieval. ACM SIGIR, pages 111--119, 2001.

Digital Library

[12]

V. Lavrenko, M. Choquette, and B. Croft. Cross-lingual relevance models. ACM SIGIR, pages 175--182, 2002.

Digital Library

[13]

V. Lavrenko and B. Croft. Relevance-based language models. ACM SIGIR, pages 120--127, 2001.

Digital Library

[14]

M. Lesk and B. Croft. Word-word associations in document retrieval systems. American Documentation, 20:20--27, 1969.

[15]

S. Liu, F. Lin, C. Yu, and W. Meng. An effective approach to document retrieval via utilizing wordnet and recognizing phrases. ACM SIGIR, pages 266--272, 2004.

Digital Library

[16]

X. Liu and W. B. Croft. Cluster-based retrieval using language models. In ACM SIGIR, pages 186--193, 2004.

Digital Library

[17]

R. Mandala, T. tokunaga, H. Tanaka, and K. Satoh. Ad hoc retrieval experiments using wordnet and automatically constructed thesauri. TREC-7, pages 475--481, 1998.

[18]

D. Metzler, Y. Bernstein, B. Croft, A. Moffat, and J. Zobel. Similarity measures for tracking information flow. ACM CIKM, pages 517--524, 2005.

Digital Library

[19]

V. Murdock and B. Croft. Simple translation models for sentence retrieval in factoid question answering. ACM SIGIR, pages 31--35, 2004.

[20]

J.-Y. Nie, M. Simard, P. Isabelle, and R. Durand. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In ACM SIGIR, pages 74--81, 1999.

Digital Library

[21]

H. J. Peat and P. Willett. The limitations of term co-occurrence data for query expansion in document retrieval systems. J. of Information science, 42(5):378--383, 1991.

[22]

J. Ponte and W. B. Croft. A language modeling approach to information retrieval. ACM SIGIR, pages 275--281, 1998.

Digital Library

[23]

M. Porter. An algorithm for suffix stripping. Program, 14(3), 1980.

[24]

Y. Qiu and H. Frei. Concept based query expansion. ACM SIGIR, pages 160--169, 1993.

Digital Library

[25]

C. J. V. Rijbergen. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, pages 106--119, 1977.

[26]

C. J. V. Rijsbergen. Information retrieval. Butterworths, 1979.

Digital Library

[27]

S. Robertson and K. Sparck. Relevance weighting of search terms. Journal of American Society for Information Science, 27:129--146, 1976.

[28]

G. Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989.

Digital Library

[29]

G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill., 1983.

Digital Library

[30]

G. Salton, C. S. Yang, and C. T. Yu. A theory of term importance in automatic text analysis. Journal of American Society for Information Science, 26(1):33--44, 1975.

[31]

H. Schutze and J. O. Pedersen. A co-occurrence based thesaurus and two applications to information retrieval. Information and processing management, 33(3):307--318, 1997.

Digital Library

[32]

A. F. Smeaton and C. J. V. Rijsbergen. The retrieval effects of query expansion on a feedback document retrieval system. The Computer Journal, 26(3):239--246, 1983.

[33]

T. Tao, X. Wang, Q. Mei, and C. Zhai. Language model information retrieval with document expansion. In HLT-NAACL, pages 407--414, 2006.

Digital Library

[34]

H. Turtle and W. B. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187--222, 1991.

Digital Library

[35]

E. M. Voorhess. Query expansion using lexical-semantic relations. ACM SIGIR, pages 61--69, 1994.

Digital Library

[36]

X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In ACM SIGIR, pages 178--185, 2006.

Digital Library

[37]

F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80--83, 1945.

[38]

J. Xu and B. Croft. Query expansion using local and global document analysis. ACM SIGIR, pages 4--11, 1996.

Digital Library

[39]

J. Xu, R. Weischedel, and C. Nguyen. Evaluating a probabilistic model for cross-lingual information retrieval. ACM SIGIR, pages 105--110, 2001.

Digital Library

[40]

X. Xue, J. Jeon, and W. B. Croft. Retrieval models for question and answer archives. In ACM SIGIR, pages 475--482, 2008.

Digital Library

[41]

C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. ACM CIKM, pages 403--410, 2001.

Digital Library

[42]

C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. ACM SIGIR, pages 334--342, 2001.

Digital Library

Cited By

Oo HPa W(2024)Myanmar News Retrieval Using Kernelized Neural Ranking Model2024 IEEE Conference on Computer Applications (ICCA)10.1109/ICCA62361.2024.10533144(1-6)Online publication date: 16-Mar-2024
https://doi.org/10.1109/ICCA62361.2024.10533144
Ghasemi SShakery A(2024)Harnessing the Power of Metadata for Enhanced Question Retrieval in Community Question AnsweringIEEE Access10.1109/ACCESS.2024.339544912(65768-65779)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3395449
Hambarde KProença H(2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3295776
Show More Cited By

Index Terms

Estimation of statistical translation models based on mutual information for ad hoc information retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Adaptation of machine translation for multilingual information retrieval in the medical domain

Objective: We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also ...
Syntactic discriminative language model rerankers for statistical machine translation

This article describes a method that successfully exploits syntactic features for n-best translation candidate reranking using perceptrons. We motivate the utility of syntax by demonstrating the superior performance of parsers over n-gram language ...
Integrating source-language context into phrase-based statistical machine translation

The translation features typically used in Phrase-Based Statistical Machine Translation (PB-SMT) model dependencies between the source and target phrases, but not among the phrases in the source language themselves. A swathe of research has demonstrated ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

July 2010

944 pages

ISBN:9781450301534

DOI:10.1145/1835449

General Chairs:
Fabio Crestani
University of Lugano, CH
,
Stéphane Marchand-Maillet
University of Geneva, CH
,
Program Chairs:
Hsin-Hsi Chen
National Taiwan University, TW
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jacques Savoy
University of Neuchatel, CH

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '10

Sponsor:

SIGIR

SIGIR '10: The 33rd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2010

Geneva, Switzerland

Acceptance Rates

SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

60
Total Citations
View Citations
645
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Oo HPa W(2024)Myanmar News Retrieval Using Kernelized Neural Ranking Model2024 IEEE Conference on Computer Applications (ICCA)10.1109/ICCA62361.2024.10533144(1-6)Online publication date: 16-Mar-2024
https://doi.org/10.1109/ICCA62361.2024.10533144
Ghasemi SShakery A(2024)Harnessing the Power of Metadata for Enhanced Question Retrieval in Community Question AnsweringIEEE Access10.1109/ACCESS.2024.339544912(65768-65779)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3395449
Hambarde KProença H(2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3295776
Vo DZarrinkalam FPham BArabzadeh NSalamat SBagheri E(2023)Neural Ad-Hoc Retrieval Meets Open Information ExtractionAdvances in Information Retrieval10.1007/978-3-031-28238-6_57(655-663)Online publication date: 2-Apr-2023
https://dl.acm.org/doi/10.1007/978-3-031-28238-6_57
Guo JCai YFan YSun FZhang RCheng X(2022)Semantic Models for the First-Stage Retrieval: A Comprehensive ReviewACM Transactions on Information Systems10.1145/348625040:4(1-42)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3486250
Fallahnejad ZBeigy H(2022)Attention-based skill translation models for expert findingExpert Systems with Applications10.1016/j.eswa.2021.116433(116433)Online publication date: Jan-2022
https://doi.org/10.1016/j.eswa.2021.116433
Chaghari AFeizi‐Derakhshi MBalafar M(2020)The combination of term relations analysis and weighted frequent itemset model for multidocument summarizationComputational Intelligence10.1111/coin.1227036:2(783-812)Online publication date: 29-Jan-2020
https://doi.org/10.1111/coin.12270
Menon RKaartik JKarthik Nambiar ET.K. AS. A(2020)Improving Ranking in Document based Search Systems2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9143047(914-921)Online publication date: Jun-2020
https://doi.org/10.1109/ICOEI48184.2020.9143047
Jalilvand ANeshati M(2020)Channel retrieval: finding relevant broadcasters on TelegramSocial Network Analysis and Mining10.1007/s13278-020-0629-z10:1Online publication date: 30-Mar-2020
https://doi.org/10.1007/s13278-020-0629-z
HajiAminShirazi SMomtazi S(2020)Cross-lingual embedding for cross-lingual question retrieval in low-resource community question answeringMachine Translation10.1007/s10590-020-09257-734:4(287-303)Online publication date: 1-Dec-2020
https://dl.acm.org/doi/10.1007/s10590-020-09257-7
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten