Abstract
A machine-readable bilingual dictionary plays a crucial role in many natural language processing tasks, such as statistical machine translation and cross-language information retrieval. In this article, we propose a framework for extracting a bilingual dictionary from comparable corpora by exploiting a novel combination of topic modeling and word aligners such as the IBM models. Using a multilingual topic model, we first convert a comparable document-aligned corpus into a parallel topic-aligned corpus. This novel topic-aligned corpus is similar in structure to the sentence-aligned corpus frequently employed in statistical machine translation and allows us to extract a bilingual dictionary using a word alignment model.
The main advantages of our framework is that (1) no seed dictionary is necessary for bootstrapping the process, and (2) multilingual comparable corpora in more than two languages can also be exploited. In our experiments on a large-scale Wikipedia dataset, we demonstrate that our approach can extract higher precision dictionaries compared to previous approaches and that our method improves further as we add more languages to the dataset.
- Ahmet Aker, Monica Lestari Paramita, Marcis Pinnis, and Robert Gaizauskas. 2014. Bilingual dictionaries for all EU languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).Google Scholar
- Daniel Andrade, Takuya Matsuzaki, and Jun’ichi Tsujii. 2011. Effective use of dependency structure for bilingual lexicon creation. In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’11). Lecture Notes in Computer Science, Vol. 6639, Springer, 80--92. Google ScholarDigital Library
- David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, 25--32. Google ScholarDigital Library
- Yoshiaki Arai, Tomohiro Fukuhara, Hidetaka Masuda, and Hiroshi Nakagawa. 2008.Analyzing interlanguage links of Wikipedias. In Proceedings of the Wikimania Conference.Google Scholar
- Timothy Baldwin. 2011. MWEs and topic modelling: Enhancing machine learning with linguistics. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World (MWE’11). Association for Computational Linguistics, Stroudsburg, PA, 1--1. http://dl.acm.org/citation.cfm?id=2021121.2021123. Google ScholarDigital Library
- David M. Blei and Michael I. Jordan. 2006. Variational inference for Dirichlet process mixtures. Bayesian Anal. 1, 1, 121--143.Google ScholarCross Ref
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Machine Learn. Res. 3, 993--1022. Google ScholarDigital Library
- Francis Bond, Hitoshi Isahara, Kyoko Kanzaki, and Kiyotaka Uchimoto. 2008. Boot-strapping a WordNet using multiple existing WordNets. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08).Google Scholar
- Jordan Boyd-Graber and David M. Blei. 2009. Multilingual topic models for unaligned text. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI’09). Google ScholarDigital Library
- Peter F. Brown, Vincent J. Della Pietra, Stephen Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguistics 19, 2, 263--311. Google ScholarDigital Library
- Sarath Chandar A. P., Stanislas Lauly, Hugo Larochelle, Mitesh M. Khapra, Balaraman Ravindran, Vikas C. Raykar, and Amrita Saha. 2014.An autoencoder approach to learning bilingual word representations. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’14).Google Scholar
- Hal Daume III and Jagadeesh Jagarlamudi. 2011. Domain adaptation for machine translation by mining unseen words. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11). 407--412. Google ScholarDigital Library
- Gerard de Melo and Gerhard Weikum. 2009. Towards a universal Wordnet by learning from combined evidence. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, 513--522. Google ScholarDigital Library
- Hervé Déjean, Éric Gaussier, and Fatia Sadat. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). 1--7. Google ScholarDigital Library
- Kevin Duh, Ching-Man Au Yeung, Tomoharu Iwata, and Masaaki Nagata. 2013. Managing information disparity in multilingual document collections. ACM Trans. Speech Lang. Process. 10, 1, (March 2013), Article 1. http://doi.acm.org/10.1145/2442076.2442077. Google ScholarDigital Library
- Pascale Fung and Percy Cheung. 2004. Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04).Google Scholar
- Pascale Fung and Yuen Yee Lo. 1998. Translating unknown words using nonparallel, comparable texts. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (ACL-COLING’98).Google Scholar
- Eric Gaussier, J. M. Renders, I. Matveeva, C. Goutte, and H. Dejean. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04). 526--533. Google ScholarDigital Library
- Tim Gollins and Mark Sanderson. 2001. Improving cross language retrieval with triangulated translation. In Proceedings of the 24th ACM Conference of the Special Interest Group in Information Retrieval (SIGIR’01). Google ScholarDigital Library
- Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’08). 771--779.Google Scholar
- Gregor Heinrich. 2004. Parameter estimation for text analysis. Technical Report. rsonix GmbH and University of Leipzig.Google Scholar
- Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online learning for latent dirichlet allocation. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’10). 856--864.Google Scholar
- Yuening Hu, Ke Zhai, Vladimir Eidelman, and Jordan Boyd-Graber. 2014. Polylingual tree-based topic models for translation domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Volume 1: Long Papers, 1166--1176.Google ScholarCross Ref
- Jagadeesh Jagarlamudi and Hal Daume. 2010. Extracting multilingual topics from unaligned comparable corpora. In Proceedings of the 32nd European Conference on Advances in Information Retrieval (ECIR’10). Google ScholarDigital Library
- Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingual distributed representations of words. In Proceedings of the International Conference on Computational Linguistics (COLING’12). 1459--1474.Google Scholar
- Philipp Koehn. 2010. Statistical Machine Translation (1st Ed.). Cambridge University Press, New York, NY. Google ScholarDigital Library
- Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL Workshop on Unsupervised Lexical Acquisition. Google ScholarDigital Library
- Hong-seok Kwon, Hyeong-won Seo, and Jae-hoon Kim. 2013. Bilingual lexicon extraction via pivot language and word alignment tool. In Proceedings of the 6th Workshop on Building and Using Comparable Corpora. 11--15.Google Scholar
- Audrey Laroche and Philippe Langlais. 2010. Revisiting context-based projection methods for term-translation spotting in comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). Google ScholarDigital Library
- Percy Liang, Taskar Ben, and Klein Dan. 2006. Alignment by agreement. In Proceedings of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (ACL-HLT’06). 104--111. Google ScholarDigital Library
- Xiaodong Liu, Kevin Duh, and Yuji Matsumoto. 2013. Topic models + word alignment = a flexible framework for extracting bilingual dictionary from comparable corpus. In Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL’13). 212.Google Scholar
- Bernardo Magnini, Carlo Strapparava, Fabio Ciravegna, and Emanuele Pianta. 1994. Multilingual lexical knowledge bases: Applied WordNet prospects. In Proceedings of the International Workshop on the Future of the Dictionary.Google Scholar
- Mausam, Stephen Soderland, Oren Etzioni, Daniel S. Weld, Michael Skinner, and Jeff Bilmes. 2009.Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the 47th Meeting of the Association for Computational Linguistics (ACL’09). Google ScholarDigital Library
- David Mimno, Hanna Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009.Polylingual topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’09). Google ScholarDigital Library
- Thomas Minka. 2000. Estimating a Dirichlet distribution. Microsoft Research.Google Scholar
- Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11). 529--533. Google ScholarDigital Library
- Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen. 2009. Mining multilingual topics from Wikipedia. In Proceedings of the 18th International Conference on the World Wide Web (WWW’09). Google ScholarDigital Library
- Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. 29, 1, 19--51. DOI:http://dx.doi.org/10.1162/089120103321337421. Google ScholarDigital Library
- Michael Paul, Hirofumi Yamamoto, Eiichiro Sumita, and Satoshi Nakamura. 2009.On the importance of pivot language selection for statistical machine translation. In Proceedings of the North American Chapter of the Association for Computational Linguistics---Human Language Technologies (NAACL/HLT’09). Google ScholarDigital Library
- Reinhard Rapp. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL’95). Google ScholarDigital Library
- Philip Resnik, Douglas Oard, and Gina Levow. 2001. Improved cross-language retrieval using backoff translation. In Proceedings of the 1st International Conference on Human Language Technology Research. 1--3. Google ScholarDigital Library
- Darcey Riley and Daniel Gildea. 2010. Improving the performance of GIZA++ using variational bayes. Technical Report. The University of Rochester, Computer Science Department.Google Scholar
- Fatiha Sadat, Herve Dejean, and Eric Gaussier. 2002. A combination of models for bilingual lexicon extraction from comparable corpora. In Proceedings of the Seminaire Papillon.Google Scholar
- Akihiro Tamura, Taro Watanabe, and Eiichiro Sumita. 2012. Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12). 24--36. Google ScholarDigital Library
- Yee W. Teh, David Newman, and Max Welling. 2006. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’06). 1353--1360.Google Scholar
- Ashish Vaswani, Liang Huang, and David Chiang. 2012. Smaller alignment models for better translations: Unsupervised word alignment with the l 0-norm. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL’12). 311--319. Google ScholarDigital Library
- Svitlana Volkova, Theresa Wilson, and David Yarowsky. 2013. Exploring demographic language variations to improve multilingual sentiment analysis in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 18--21.Google Scholar
- Ivan Vulić, Wim De Smet, and Marie-Francine Moens. 2011. Identifying word translations from comparable corpora using latent topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL’11). 479--484. Google ScholarDigital Library
- Hua Wu and Haifeng Wang. 2009. Revisiting pivot language approach for machine translation. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL-AFNLP’09). 154--162. Google ScholarDigital Library
- Duo Zhang, Qiaozhu Mei, and ChengXiang Zhai. 2010. Cross-lingual latent topic extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10). 1128--1137. Google ScholarDigital Library
Index Terms
- Multilingual Topic Models for Bilingual Dictionary Extraction
Recommendations
Adapting a Bilingual Dictionary to Domains
Two methods using comparable corpora to select translation equivalents appropriate to a domain were devised and evaluated. The first method ranks translation equivalents of a target word according to similarity of their contexts to that of the target ...
Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation
The availability of machine-readable bilingual linguistic resources is crucial not only for rule-based machine translation but also for other applications such as cross-lingual information retrieval. However, the building of such resources (bilingual ...
Exploring Bilingual Word Vectors for Hindi-English Cross-Language Information Retrieval
ICIA-16: Proceedings of the International Conference on Informatics and AnalyticsTodays, The internet has become a source of multi-lingual content. Users are not aware of multiple languages, so the language diversity becomes a great barrier for world communication. Cross-Language Information Retrieval (CLIR) provides a solution for ...
Comments