research-article

Multilingual Topic Models for Bilingual Dictionary Extraction

Authors:
Xiaodong Liu

Nara Institute of Science and Technology

Nara Institute of Science and Technology
View Profile

,
Kevin Duh

Nara Institute of Science and Technology

Nara Institute of Science and Technology
View Profile

,
Yuji Matsumoto

Nara Institute of Science and Technology

Nara Institute of Science and Technology
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 14 Issue 3Article No.: 11pp 1–22https://doi.org/10.1145/2699939

Published:12 June 2015Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

A machine-readable bilingual dictionary plays a crucial role in many natural language processing tasks, such as statistical machine translation and cross-language information retrieval. In this article, we propose a framework for extracting a bilingual dictionary from comparable corpora by exploiting a novel combination of topic modeling and word aligners such as the IBM models. Using a multilingual topic model, we first convert a comparable document-aligned corpus into a parallel topic-aligned corpus. This novel topic-aligned corpus is similar in structure to the sentence-aligned corpus frequently employed in statistical machine translation and allows us to extract a bilingual dictionary using a word alignment model.

The main advantages of our framework is that (1) no seed dictionary is necessary for bootstrapping the process, and (2) multilingual comparable corpora in more than two languages can also be exploited. In our experiments on a large-scale Wikipedia dataset, we demonstrate that our approach can extract higher precision dictionaries compared to previous approaches and that our method improves further as we add more languages to the dataset.

References

Ahmet Aker, Monica Lestari Paramita, Marcis Pinnis, and Robert Gaizauskas. 2014. Bilingual dictionaries for all EU languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).Google Scholar
Daniel Andrade, Takuya Matsuzaki, and Jun’ichi Tsujii. 2011. Effective use of dependency structure for bilingual lexicon creation. In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’11). Lecture Notes in Computer Science, Vol. 6639, Springer, 80--92. Google ScholarDigital Library
David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, 25--32. Google ScholarDigital Library
Yoshiaki Arai, Tomohiro Fukuhara, Hidetaka Masuda, and Hiroshi Nakagawa. 2008.Analyzing interlanguage links of Wikipedias. In Proceedings of the Wikimania Conference.Google Scholar
Timothy Baldwin. 2011. MWEs and topic modelling: Enhancing machine learning with linguistics. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World (MWE’11). Association for Computational Linguistics, Stroudsburg, PA, 1--1. http://dl.acm.org/citation.cfm?id=2021121.2021123. Google ScholarDigital Library
David M. Blei and Michael I. Jordan. 2006. Variational inference for Dirichlet process mixtures. Bayesian Anal. 1, 1, 121--143.Google ScholarCross Ref
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Machine Learn. Res. 3, 993--1022. Google ScholarDigital Library
Francis Bond, Hitoshi Isahara, Kyoko Kanzaki, and Kiyotaka Uchimoto. 2008. Boot-strapping a WordNet using multiple existing WordNets. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08).Google Scholar
Jordan Boyd-Graber and David M. Blei. 2009. Multilingual topic models for unaligned text. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI’09). Google ScholarDigital Library
Peter F. Brown, Vincent J. Della Pietra, Stephen Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguistics 19, 2, 263--311. Google ScholarDigital Library
Sarath Chandar A. P., Stanislas Lauly, Hugo Larochelle, Mitesh M. Khapra, Balaraman Ravindran, Vikas C. Raykar, and Amrita Saha. 2014.An autoencoder approach to learning bilingual word representations. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’14).Google Scholar
Hal Daume III and Jagadeesh Jagarlamudi. 2011. Domain adaptation for machine translation by mining unseen words. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11). 407--412. Google ScholarDigital Library
Gerard de Melo and Gerhard Weikum. 2009. Towards a universal Wordnet by learning from combined evidence. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, 513--522. Google ScholarDigital Library
Hervé Déjean, Éric Gaussier, and Fatia Sadat. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). 1--7. Google ScholarDigital Library
Kevin Duh, Ching-Man Au Yeung, Tomoharu Iwata, and Masaaki Nagata. 2013. Managing information disparity in multilingual document collections. ACM Trans. Speech Lang. Process. 10, 1, (March 2013), Article 1. http://doi.acm.org/10.1145/2442076.2442077. Google ScholarDigital Library
Pascale Fung and Percy Cheung. 2004. Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04).Google Scholar
Pascale Fung and Yuen Yee Lo. 1998. Translating unknown words using nonparallel, comparable texts. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (ACL-COLING’98).Google Scholar
Eric Gaussier, J. M. Renders, I. Matveeva, C. Goutte, and H. Dejean. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04). 526--533. Google ScholarDigital Library
Tim Gollins and Mark Sanderson. 2001. Improving cross language retrieval with triangulated translation. In Proceedings of the 24th ACM Conference of the Special Interest Group in Information Retrieval (SIGIR’01). Google ScholarDigital Library
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’08). 771--779.Google Scholar
Gregor Heinrich. 2004. Parameter estimation for text analysis. Technical Report. rsonix GmbH and University of Leipzig.Google Scholar
Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online learning for latent dirichlet allocation. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’10). 856--864.Google Scholar
Yuening Hu, Ke Zhai, Vladimir Eidelman, and Jordan Boyd-Graber. 2014. Polylingual tree-based topic models for translation domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Volume 1: Long Papers, 1166--1176.Google ScholarCross Ref
Jagadeesh Jagarlamudi and Hal Daume. 2010. Extracting multilingual topics from unaligned comparable corpora. In Proceedings of the 32nd European Conference on Advances in Information Retrieval (ECIR’10). Google ScholarDigital Library
Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingual distributed representations of words. In Proceedings of the International Conference on Computational Linguistics (COLING’12). 1459--1474.Google Scholar
Philipp Koehn. 2010. Statistical Machine Translation (1st Ed.). Cambridge University Press, New York, NY. Google ScholarDigital Library
Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL Workshop on Unsupervised Lexical Acquisition. Google ScholarDigital Library
Hong-seok Kwon, Hyeong-won Seo, and Jae-hoon Kim. 2013. Bilingual lexicon extraction via pivot language and word alignment tool. In Proceedings of the 6th Workshop on Building and Using Comparable Corpora. 11--15.Google Scholar
Audrey Laroche and Philippe Langlais. 2010. Revisiting context-based projection methods for term-translation spotting in comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). Google ScholarDigital Library
Percy Liang, Taskar Ben, and Klein Dan. 2006. Alignment by agreement. In Proceedings of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (ACL-HLT’06). 104--111. Google ScholarDigital Library
Xiaodong Liu, Kevin Duh, and Yuji Matsumoto. 2013. Topic models + word alignment = a flexible framework for extracting bilingual dictionary from comparable corpus. In Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL’13). 212.Google Scholar
Bernardo Magnini, Carlo Strapparava, Fabio Ciravegna, and Emanuele Pianta. 1994. Multilingual lexical knowledge bases: Applied WordNet prospects. In Proceedings of the International Workshop on the Future of the Dictionary.Google Scholar
Mausam, Stephen Soderland, Oren Etzioni, Daniel S. Weld, Michael Skinner, and Jeff Bilmes. 2009.Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the 47th Meeting of the Association for Computational Linguistics (ACL’09). Google ScholarDigital Library
David Mimno, Hanna Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009.Polylingual topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’09). Google ScholarDigital Library
Thomas Minka. 2000. Estimating a Dirichlet distribution. Microsoft Research.Google Scholar
Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11). 529--533. Google ScholarDigital Library
Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen. 2009. Mining multilingual topics from Wikipedia. In Proceedings of the 18th International Conference on the World Wide Web (WWW’09). Google ScholarDigital Library
Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. 29, 1, 19--51. DOI:http://dx.doi.org/10.1162/089120103321337421. Google ScholarDigital Library
Michael Paul, Hirofumi Yamamoto, Eiichiro Sumita, and Satoshi Nakamura. 2009.On the importance of pivot language selection for statistical machine translation. In Proceedings of the North American Chapter of the Association for Computational Linguistics---Human Language Technologies (NAACL/HLT’09). Google ScholarDigital Library
Reinhard Rapp. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL’95). Google ScholarDigital Library
Philip Resnik, Douglas Oard, and Gina Levow. 2001. Improved cross-language retrieval using backoff translation. In Proceedings of the 1st International Conference on Human Language Technology Research. 1--3. Google ScholarDigital Library
Darcey Riley and Daniel Gildea. 2010. Improving the performance of GIZA++ using variational bayes. Technical Report. The University of Rochester, Computer Science Department.Google Scholar
Fatiha Sadat, Herve Dejean, and Eric Gaussier. 2002. A combination of models for bilingual lexicon extraction from comparable corpora. In Proceedings of the Seminaire Papillon.Google Scholar
Akihiro Tamura, Taro Watanabe, and Eiichiro Sumita. 2012. Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12). 24--36. Google ScholarDigital Library
Yee W. Teh, David Newman, and Max Welling. 2006. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’06). 1353--1360.Google Scholar
Ashish Vaswani, Liang Huang, and David Chiang. 2012. Smaller alignment models for better translations: Unsupervised word alignment with the l 0-norm. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL’12). 311--319. Google ScholarDigital Library
Svitlana Volkova, Theresa Wilson, and David Yarowsky. 2013. Exploring demographic language variations to improve multilingual sentiment analysis in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 18--21.Google Scholar
Ivan Vulić, Wim De Smet, and Marie-Francine Moens. 2011. Identifying word translations from comparable corpora using latent topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL’11). 479--484. Google ScholarDigital Library
Hua Wu and Haifeng Wang. 2009. Revisiting pivot language approach for machine translation. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL-AFNLP’09). 154--162. Google ScholarDigital Library
Duo Zhang, Qiaozhu Mei, and ChengXiang Zhai. 2010. Cross-lingual latent topic extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10). 1128--1137. Google ScholarDigital Library

Index Terms

Multilingual Topic Models for Bilingual Dictionary Extraction
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Adapting a Bilingual Dictionary to Domains

Two methods using comparable corpora to select translation equivalents appropriate to a domain were devised and evaluated. The first method ranks translation equivalents of a target word according to similarity of their contexts to that of the target ...
Read More
Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation

The availability of machine-readable bilingual linguistic resources is crucial not only for rule-based machine translation but also for other applications such as cross-lingual information retrieval. However, the building of such resources (bilingual ...
Read More
Exploring Bilingual Word Vectors for Hindi-English Cross-Language Information Retrieval
ICIA-16: Proceedings of the International Conference on Informatics and Analytics

Todays, The internet has become a source of multi-lingual content. Users are not aware of multiple languages, so the language diversity becomes a great barrier for world communication. Cross-Language Information Retrieval (CLIR) provides a solution for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 14, Issue 3
June 2015
90 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/2791399
Editor:
Richard Sproat
Google, Inc., USA
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 June 2015
- Accepted: 1 November 2014
- Revised: 1 September 2014
- Received: 1 May 2014
Published in tallip Volume 14, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bilingual dictionary
comparable corpus
multilingual topic model
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 356
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multilingual Topic Models for Bilingual Dictionary Extraction

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Adapting a Bilingual Dictionary to Domains

Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation

Exploring Bilingual Word Vectors for Hindi-English Cross-Language Information Retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multilingual Topic Models for Bilingual Dictionary Extraction

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Adapting a Bilingual Dictionary to Domains

Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation

Exploring Bilingual Word Vectors for Hindi-English Cross-Language Information Retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media