skip to main content
10.1145/1081870.1081958acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Mining comparable bilingual text corpora for cross-language information integration

Published:21 August 2005Publication History

ABSTRACT

Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-lingual text mining method that does not rely on any of these resources, but can exploit comparable bilingual text corpora to discover mappings between words and documents in different languages. Comparable text corpora are collections of text documents in different languages that are about similar topics; such text corpora are often naturally available (e.g., news articles in different languages published in the same time period). The main idea of our method is to exploit frequency correlations of words in different languages in the comparable corpora and discover mappings between words in different languages. Such mappings can then be used to further discover mappings between documents in different languages, achieving cross-lingual information integration. Evaluation of the proposed method on a 120MB Chinese-English comparable news collection shows that the proposed method is effective for mapping words and documents in English and Chinese. Since our method only relies on naturally available comparable corpora, it is generally applicable to any language pairs as long as we have comparable corpora.

References

  1. J. Allan et al. Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval. SIGIR Forum, 37(1):31--47, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Ballesteros and W. B. Croft. Resolving ambiguity for cross-language retrieval. In Research and Development in Information Retrieval, pages 64--71, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222--229, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. M. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Franz, J. S. McCarley, and S. Roukos. Ad hoc and multilingual information retrieval at IBM. In Text REtrieval Conference, pages 104--115, 1998.Google ScholarGoogle Scholar
  6. P. Fung. A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of ACL 1995, pages 236--243, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Kay and M. Roscheisen. Text translation alignment. Computational Linguistics, 19(1):75--102, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Masuichi, R. Flournoy, S. Kaufmann, and S. Peters. A bootstrapping method for extracting bilingual text pairs. In Proc. 18th COLINC, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the ACM SIGIR'98, pages 275--281, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Rapp. Identifying word translations in non-parallel texts. In Proceedings of ACL 1995, pages 320--322, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of SIGIR'94, pages 232--241, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. E. Robertson, S. Walker, S. Jones, M. M.Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In D. K. Harman, editor, The Third Text REtrieval Conference (TREC-3), pages 109--126, 1995.Google ScholarGoogle Scholar
  13. F. Sadat, M. Yoshikawa, and S. Uemura. Bilingual terminology acquisition from comparable corpora and phrasal translation to cross-language information retrieval. http://acl.ldc.upenn.edu/P/P03/P03-2025.pdf.Google ScholarGoogle Scholar
  14. G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Tanaka and H. Iwasaki. Extraction of lexical translation from non-aligned corpora. In Proceedings of COLING 1996, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Veronis. Parallel text processing: Alignment and use of translation corpora. In Kluwer Academic Publishers., 2000.Google ScholarGoogle ScholarCross RefCross Ref
  17. J. Xu, R. Weischedel, and C. Nguyen. Evaluating a probabilistic model for cross-lingual information retrieval. In Proceedings of ACM SIGIR 2001, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR'01, pages 334--342, Sept 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Zhai and J. Lafferty. Two-stage language models for information retrieval. In Proceedings of SIGIR'02, pages 49--56, Aug 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD 2004, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining comparable bilingual text corpora for cross-language information integration

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
      August 2005
      844 pages
      ISBN:159593135X
      DOI:10.1145/1081870

      Copyright © 2005 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 August 2005

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader