skip to main content
research-article

Multilingual Topic Models for Bilingual Dictionary Extraction

Published:12 June 2015Publication History
Skip Abstract Section

Abstract

A machine-readable bilingual dictionary plays a crucial role in many natural language processing tasks, such as statistical machine translation and cross-language information retrieval. In this article, we propose a framework for extracting a bilingual dictionary from comparable corpora by exploiting a novel combination of topic modeling and word aligners such as the IBM models. Using a multilingual topic model, we first convert a comparable document-aligned corpus into a parallel topic-aligned corpus. This novel topic-aligned corpus is similar in structure to the sentence-aligned corpus frequently employed in statistical machine translation and allows us to extract a bilingual dictionary using a word alignment model.

The main advantages of our framework is that (1) no seed dictionary is necessary for bootstrapping the process, and (2) multilingual comparable corpora in more than two languages can also be exploited. In our experiments on a large-scale Wikipedia dataset, we demonstrate that our approach can extract higher precision dictionaries compared to previous approaches and that our method improves further as we add more languages to the dataset.

References

  1. Ahmet Aker, Monica Lestari Paramita, Marcis Pinnis, and Robert Gaizauskas. 2014. Bilingual dictionaries for all EU languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).Google ScholarGoogle Scholar
  2. Daniel Andrade, Takuya Matsuzaki, and Jun’ichi Tsujii. 2011. Effective use of dependency structure for bilingual lexicon creation. In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’11). Lecture Notes in Computer Science, Vol. 6639, Springer, 80--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yoshiaki Arai, Tomohiro Fukuhara, Hidetaka Masuda, and Hiroshi Nakagawa. 2008.Analyzing interlanguage links of Wikipedias. In Proceedings of the Wikimania Conference.Google ScholarGoogle Scholar
  5. Timothy Baldwin. 2011. MWEs and topic modelling: Enhancing machine learning with linguistics. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World (MWE’11). Association for Computational Linguistics, Stroudsburg, PA, 1--1. http://dl.acm.org/citation.cfm?id=2021121.2021123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. David M. Blei and Michael I. Jordan. 2006. Variational inference for Dirichlet process mixtures. Bayesian Anal. 1, 1, 121--143.Google ScholarGoogle ScholarCross RefCross Ref
  7. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Machine Learn. Res. 3, 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Francis Bond, Hitoshi Isahara, Kyoko Kanzaki, and Kiyotaka Uchimoto. 2008. Boot-strapping a WordNet using multiple existing WordNets. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08).Google ScholarGoogle Scholar
  9. Jordan Boyd-Graber and David M. Blei. 2009. Multilingual topic models for unaligned text. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Peter F. Brown, Vincent J. Della Pietra, Stephen Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguistics 19, 2, 263--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sarath Chandar A. P., Stanislas Lauly, Hugo Larochelle, Mitesh M. Khapra, Balaraman Ravindran, Vikas C. Raykar, and Amrita Saha. 2014.An autoencoder approach to learning bilingual word representations. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’14).Google ScholarGoogle Scholar
  12. Hal Daume III and Jagadeesh Jagarlamudi. 2011. Domain adaptation for machine translation by mining unseen words. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11). 407--412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Gerard de Melo and Gerhard Weikum. 2009. Towards a universal Wordnet by learning from combined evidence. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, 513--522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Hervé Déjean, Éric Gaussier, and Fatia Sadat. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kevin Duh, Ching-Man Au Yeung, Tomoharu Iwata, and Masaaki Nagata. 2013. Managing information disparity in multilingual document collections. ACM Trans. Speech Lang. Process. 10, 1, (March 2013), Article 1. http://doi.acm.org/10.1145/2442076.2442077. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Pascale Fung and Percy Cheung. 2004. Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04).Google ScholarGoogle Scholar
  17. Pascale Fung and Yuen Yee Lo. 1998. Translating unknown words using nonparallel, comparable texts. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (ACL-COLING’98).Google ScholarGoogle Scholar
  18. Eric Gaussier, J. M. Renders, I. Matveeva, C. Goutte, and H. Dejean. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04). 526--533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tim Gollins and Mark Sanderson. 2001. Improving cross language retrieval with triangulated translation. In Proceedings of the 24th ACM Conference of the Special Interest Group in Information Retrieval (SIGIR’01). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’08). 771--779.Google ScholarGoogle Scholar
  21. Gregor Heinrich. 2004. Parameter estimation for text analysis. Technical Report. rsonix GmbH and University of Leipzig.Google ScholarGoogle Scholar
  22. Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online learning for latent dirichlet allocation. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’10). 856--864.Google ScholarGoogle Scholar
  23. Yuening Hu, Ke Zhai, Vladimir Eidelman, and Jordan Boyd-Graber. 2014. Polylingual tree-based topic models for translation domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Volume 1: Long Papers, 1166--1176.Google ScholarGoogle ScholarCross RefCross Ref
  24. Jagadeesh Jagarlamudi and Hal Daume. 2010. Extracting multilingual topics from unaligned comparable corpora. In Proceedings of the 32nd European Conference on Advances in Information Retrieval (ECIR’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingual distributed representations of words. In Proceedings of the International Conference on Computational Linguistics (COLING’12). 1459--1474.Google ScholarGoogle Scholar
  26. Philipp Koehn. 2010. Statistical Machine Translation (1st Ed.). Cambridge University Press, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL Workshop on Unsupervised Lexical Acquisition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Hong-seok Kwon, Hyeong-won Seo, and Jae-hoon Kim. 2013. Bilingual lexicon extraction via pivot language and word alignment tool. In Proceedings of the 6th Workshop on Building and Using Comparable Corpora. 11--15.Google ScholarGoogle Scholar
  29. Audrey Laroche and Philippe Langlais. 2010. Revisiting context-based projection methods for term-translation spotting in comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Percy Liang, Taskar Ben, and Klein Dan. 2006. Alignment by agreement. In Proceedings of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (ACL-HLT’06). 104--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Xiaodong Liu, Kevin Duh, and Yuji Matsumoto. 2013. Topic models + word alignment = a flexible framework for extracting bilingual dictionary from comparable corpus. In Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL’13). 212.Google ScholarGoogle Scholar
  32. Bernardo Magnini, Carlo Strapparava, Fabio Ciravegna, and Emanuele Pianta. 1994. Multilingual lexical knowledge bases: Applied WordNet prospects. In Proceedings of the International Workshop on the Future of the Dictionary.Google ScholarGoogle Scholar
  33. Mausam, Stephen Soderland, Oren Etzioni, Daniel S. Weld, Michael Skinner, and Jeff Bilmes. 2009.Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the 47th Meeting of the Association for Computational Linguistics (ACL’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. David Mimno, Hanna Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009.Polylingual topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Thomas Minka. 2000. Estimating a Dirichlet distribution. Microsoft Research.Google ScholarGoogle Scholar
  36. Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11). 529--533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen. 2009. Mining multilingual topics from Wikipedia. In Proceedings of the 18th International Conference on the World Wide Web (WWW’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. 29, 1, 19--51. DOI:http://dx.doi.org/10.1162/089120103321337421. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Michael Paul, Hirofumi Yamamoto, Eiichiro Sumita, and Satoshi Nakamura. 2009.On the importance of pivot language selection for statistical machine translation. In Proceedings of the North American Chapter of the Association for Computational Linguistics---Human Language Technologies (NAACL/HLT’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Reinhard Rapp. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL’95). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Philip Resnik, Douglas Oard, and Gina Levow. 2001. Improved cross-language retrieval using backoff translation. In Proceedings of the 1st International Conference on Human Language Technology Research. 1--3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Darcey Riley and Daniel Gildea. 2010. Improving the performance of GIZA++ using variational bayes. Technical Report. The University of Rochester, Computer Science Department.Google ScholarGoogle Scholar
  43. Fatiha Sadat, Herve Dejean, and Eric Gaussier. 2002. A combination of models for bilingual lexicon extraction from comparable corpora. In Proceedings of the Seminaire Papillon.Google ScholarGoogle Scholar
  44. Akihiro Tamura, Taro Watanabe, and Eiichiro Sumita. 2012. Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12). 24--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yee W. Teh, David Newman, and Max Welling. 2006. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’06). 1353--1360.Google ScholarGoogle Scholar
  46. Ashish Vaswani, Liang Huang, and David Chiang. 2012. Smaller alignment models for better translations: Unsupervised word alignment with the l 0-norm. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL’12). 311--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Svitlana Volkova, Theresa Wilson, and David Yarowsky. 2013. Exploring demographic language variations to improve multilingual sentiment analysis in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 18--21.Google ScholarGoogle Scholar
  48. Ivan Vulić, Wim De Smet, and Marie-Francine Moens. 2011. Identifying word translations from comparable corpora using latent topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL’11). 479--484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Hua Wu and Haifeng Wang. 2009. Revisiting pivot language approach for machine translation. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL-AFNLP’09). 154--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Duo Zhang, Qiaozhu Mei, and ChengXiang Zhai. 2010. Cross-lingual latent topic extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10). 1128--1137. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Multilingual Topic Models for Bilingual Dictionary Extraction

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 14, Issue 3
      June 2015
      90 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/2791399
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 June 2015
      • Accepted: 1 November 2014
      • Revised: 1 September 2014
      • Received: 1 May 2014
      Published in tallip Volume 14, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader