skip to main content
10.1145/1008992.1009021acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Resource selection for domain-specific cross-lingual IR

Published: 25 July 2004 Publication History

Abstract

An under-explored question in cross-language information retrieval (CLIR) is to what degree the performance of CLIR methods depends on the availability of high-quality translation resources for particular domains. To address this issue, we evaluate several competitive CLIR methods - with different training corpora - on test documents in the medical domain. Our results show severe performance degradation when using a general-purpose training corpus or a commercial machine translation system (SYSTRAN), versus a domain-specific training corpus. A related unexplored question is whether we can improve CLIR performance by systematically analyzing training resources and optimally matching them to target collections. We start exploring this problem by suggesting a simple criterion for automatically matching training resources to target corpora. By using cosine similarity between training and target corpora as resource weights we obtained an average of 5.6% improvement over using all resources with no weights. The same metric yields 99.4% of the performance obtained when an oracle chooses the optimal resource every time.

References

[1]
Brown, P.F, Pietra, D., Pietra, D, Mercer, R.L. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (1993) 263--312.
[2]
Carbonell J. G, Yang, Y., Frederking, R. E., Brown, R., Geng, Y., Lee, D. Translingual Information Retrieval: A Comparative Evaluation. In Proceedings of the IJCAI (1) 1997: 708--715.
[3]
A. Chen, H. Jiang, and F. Gey. Combining Multiple Sources for Short Query Translation in Chinese-English Cross-language Information Retrieval. In Proceedings of the Fifth International Workshop on Information Retrieval with Asian Languages, Sept. 30-Oct 1, 2000.
[4]
Darwish, K. and Oard, D. CLIR Experiments at Maryland for TREC-2002: Evidence Combination for Arabic-English Retrieval. In TREC 2002 Proceedings.
[5]
Franz, M., McCarley, J. S, and Roukos, S. Ad hoc and multilingual information retrieval at IBM. In The Seventh Text REtrieval Conference, pages 157--168, November 1998. NIST Special Publication 500--242.
[6]
Franz, M. and McCarley, J.S. Arabic Information Retrieval at IBM. In TREC 2002 proceedings.
[7]
Fraser, A., Xu, J., Weischedel, R. 2002. TREC 2002 Cross-lingual Retrieval at BBN. In TREC 2002 proceedings.
[8]
Gey, F. and Jiang H. 1999. English-German cross-language retrieval for the GIRT collection -- Exploiting a multilingual thesaurus. In TREC-8 proceedings.
[9]
Kando, N. Overview of the Third NTCIR Workshop. Working notes of the Third NTCIR Workshop Meeting. Part I:Overview. Tokyo. Japan. October 2002. p.1--16.
[10]
Khudanpur, S., Kim, W., 2002. Using cross-language cues for story-specific language modeling. In Proceedings of the International Conference on Spoken Language Processing, p. 513--516.
[11]
Khudanpur, S. Kim, W., 1999. A maximum entropy language model to integrate n-grams and topic dependencies for conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 553--556.
[12]
Kluck, M and Gey, F. The Domain-Specific Task of CLEF - Specific Evaluation Strategies in Cross-Language Information Retrieval. In C. Peters(Ed.), Proceedings of the CLEF 2000 evaluation forum.
[13]
Koehn, P. Europarl: A Multilingual Corpus for Evaluation of Machine Translation. Draft, Unpublished.
[14]
Nie, J. Y., Simard, M. and Foster, G. Using parallel web pages for multi-lingual IR. In C. Peters(Ed.), Proceedings of the CLEF 2000 evaluation forum.
[15]
Oard, D. W. and F. Gey, The TREC-2002 Arabic/English CLIR Track. In TREC 2002 proceedings.
[16]
Oard, D. When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. Cross-Language Information Retrieval: A Research Roadmap. Workshop at SIGIR-2002, Tampere Finland August 15, 2002.
[17]
Och, F. J. and Hermann N. Improved Statistical Alignment Models. In Proc. of the 38th Annual Meeting of the Association for Computational Linguistics, (2000) pp. 440--447.
[18]
Ogilvie, P. and Callan, J. Experiments using the Lemur toolkit. In Proceedings of the Tenth Text Retrieval Conference (TREC-10). (2001).
[19]
Peters, C. Results of the CLEF 2003 Cross-Language System Evaluation Campaign. Working Notes for the CLEF 2003 Workshop, 21-22 August, Trondheim, Norway.
[20]
Resnik, P. Mining the Web for Bilingual Text. In 37th Annual Meeting of the Association for Computational Linguistics (ACL'99), College Park, Maryland, June 1999.
[21]
Rogati, M and Yang, Y. Multilingual Information Retrieval using Open, Transparent Resources in CLEF 2003 . In C. Peters (Ed.), Results of the CLEF2003 cross-language evaluation forum.
[22]
Savoy, J. A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10) (1999) 944--952.
[23]
Seymore, K., Rosenfeld, R. 1997. Using story topics for language model adaptation. In Proceedings of the European Conference on Speech Communication and Technology.

Cited By

View all
  • (2020)Searching Covid‐19 by linguistic register: Parallels and warrant for a new retrieval modelProceedings of the Association for Information Science and Technology10.1002/pra2.24657:1Online publication date: 22-Oct-2020
  • (2011)Indexing and weighting of multilingual and mixed documentsProceedings of the South African Institute of Computer Scientists and Information Technologists Conference on Knowledge, Innovation and Leadership in a Diverse, Multidisciplinary Environment10.1145/2072221.2072240(161-170)Online publication date: 3-Oct-2011
  • (2009)Exploring the Effectiveness of Chinese-to-English Machine Translation for CLIR Applications in Earthquake EngineeringJournal of Computing in Civil Engineering10.1061/(ASCE)0887-3801(2009)23:3(140)23:3(140-147)Online publication date: May-2009
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-language information retrieval
  2. domain-specific translation

Qualifiers

  • Article

Conference

SIGIR04
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Searching Covid‐19 by linguistic register: Parallels and warrant for a new retrieval modelProceedings of the Association for Information Science and Technology10.1002/pra2.24657:1Online publication date: 22-Oct-2020
  • (2011)Indexing and weighting of multilingual and mixed documentsProceedings of the South African Institute of Computer Scientists and Information Technologists Conference on Knowledge, Innovation and Leadership in a Diverse, Multidisciplinary Environment10.1145/2072221.2072240(161-170)Online publication date: 3-Oct-2011
  • (2009)Exploring the Effectiveness of Chinese-to-English Machine Translation for CLIR Applications in Earthquake EngineeringJournal of Computing in Civil Engineering10.1061/(ASCE)0887-3801(2009)23:3(140)23:3(140-147)Online publication date: May-2009
  • (2008)Corpus microsurgeryProceedings of the 17th ACM conference on Information and knowledge management10.1145/1458082.1458281(1365-1366)Online publication date: 26-Oct-2008
  • (2008)Integrating Cross-Language Hierarchies and Its Application to Retrieving Relevant DocumentsACM Transactions on Asian Language Information Processing10.1145/1386869.13868707:3(1-22)Online publication date: 1-Jun-2008
  • (2006)Pragmatic text miningProceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1150402.1150520(852-861)Online publication date: 20-Aug-2006
  • (2006)Fitness assessment of document modelInternational Journal of Systems Science10.1080/0020772060089153937:13(893-903)Online publication date: 20-Oct-2006
  • (2006)A cross-lingual framework for web news taxonomy integrationProceedings of the Third Asia conference on Information Retrieval Technology10.1007/11880592_21(270-283)Online publication date: 16-Oct-2006
  • (2005)Bootstrapping dictionaries for cross-language information retrievalProceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval10.1145/1076034.1076124(528-535)Online publication date: 15-Aug-2005
  • (2004)Customizing parallel corpora at the document levelProceedings of the ACL 2004 on Interactive poster and demonstration sessions10.3115/1219044.1219049(5-es)Online publication date: 21-Jul-2004

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media