ABSTRACT
Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-lingual text mining method that does not rely on any of these resources, but can exploit comparable bilingual text corpora to discover mappings between words and documents in different languages. Comparable text corpora are collections of text documents in different languages that are about similar topics; such text corpora are often naturally available (e.g., news articles in different languages published in the same time period). The main idea of our method is to exploit frequency correlations of words in different languages in the comparable corpora and discover mappings between words in different languages. Such mappings can then be used to further discover mappings between documents in different languages, achieving cross-lingual information integration. Evaluation of the proposed method on a 120MB Chinese-English comparable news collection shows that the proposed method is effective for mapping words and documents in English and Chinese. Since our method only relies on naturally available comparable corpora, it is generally applicable to any language pairs as long as we have comparable corpora.
- J. Allan et al. Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval. SIGIR Forum, 37(1):31--47, 2003. Google ScholarDigital Library
- L. Ballesteros and W. B. Croft. Resolving ambiguity for cross-language retrieval. In Research and Development in Information Retrieval, pages 64--71, 1998. Google ScholarDigital Library
- A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222--229, 1999. Google ScholarDigital Library
- T. M. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991. Google ScholarDigital Library
- M. Franz, J. S. McCarley, and S. Roukos. Ad hoc and multilingual information retrieval at IBM. In Text REtrieval Conference, pages 104--115, 1998.Google Scholar
- P. Fung. A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of ACL 1995, pages 236--243, 1995. Google ScholarDigital Library
- M. Kay and M. Roscheisen. Text translation alignment. Computational Linguistics, 19(1):75--102, 1993. Google ScholarDigital Library
- H. Masuichi, R. Flournoy, S. Kaufmann, and S. Peters. A bootstrapping method for extracting bilingual text pairs. In Proc. 18th COLINC, 2000. Google ScholarDigital Library
- J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the ACM SIGIR'98, pages 275--281, 1998. Google ScholarDigital Library
- R. Rapp. Identifying word translations in non-parallel texts. In Proceedings of ACL 1995, pages 320--322, 1995. Google ScholarDigital Library
- S. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of SIGIR'94, pages 232--241, 1994. Google ScholarDigital Library
- S. E. Robertson, S. Walker, S. Jones, M. M.Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In D. K. Harman, editor, The Third Text REtrieval Conference (TREC-3), pages 109--126, 1995.Google Scholar
- F. Sadat, M. Yoshikawa, and S. Uemura. Bilingual terminology acquisition from comparable corpora and phrasal translation to cross-language information retrieval. http://acl.ldc.upenn.edu/P/P03/P03-2025.pdf.Google Scholar
- G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. Google ScholarDigital Library
- K. Tanaka and H. Iwasaki. Extraction of lexical translation from non-aligned corpora. In Proceedings of COLING 1996, 1996. Google ScholarDigital Library
- J. Veronis. Parallel text processing: Alignment and use of translation corpora. In Kluwer Academic Publishers., 2000.Google ScholarCross Ref
- J. Xu, R. Weischedel, and C. Nguyen. Evaluating a probabilistic model for cross-lingual information retrieval. In Proceedings of ACM SIGIR 2001, 2001. Google ScholarDigital Library
- C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR'01, pages 334--342, Sept 2001. Google ScholarDigital Library
- C. Zhai and J. Lafferty. Two-stage language models for information retrieval. In Proceedings of SIGIR'02, pages 49--56, Aug 2002. Google ScholarDigital Library
- C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD 2004, 2004. Google ScholarDigital Library
Index Terms
- Mining comparable bilingual text corpora for cross-language information integration
Recommendations
Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora
An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns ...
Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrievalThis paper presents an approach to bilingual lexicon extraction from comparable corpora and evaluations on Cross-Language Information Retrieval. We explore a bi-directional extraction of bilingual terminology primarily from comparable corpora. A ...
Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach
AsianIR '03: Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11Recent years saw an increased interest in the use and the construction of large corpora. With this increased interest and awareness has come an expansion in the application to knowledge acquisition and bilingual terminology extraction. The present paper ...
Comments