skip to main content
research-article

Chinese-Japanese Machine Translation Exploiting Chinese Characters

Published: 01 October 2013 Publication History

Abstract

The Chinese and Japanese languages share Chinese characters. Since the Chinese characters in Japanese originated from ancient China, many common Chinese characters exist between these two languages. Since Chinese characters contain significant semantic information and common Chinese characters share the same meaning in the two languages, they can be quite useful in Chinese-Japanese machine translation (MT). We therefore propose a method for creating a Chinese character mapping table for Japanese, traditional Chinese, and simplified Chinese, with the aim of constructing a complete resource of common Chinese characters. Furthermore, we point out two main problems in Chinese word segmentation for Chinese-Japanese MT, namely, unknown words and word segmentation granularity, and propose an approach exploiting common Chinese characters to solve these problems. We also propose a statistical method for detecting other semantically equivalent Chinese characters other than the common ones and a method for exploiting shared Chinese characters in phrase alignment. Results of the experiments carried out on a state-of-the-art phrase-based statistical MT system and an example-based MT system show that our proposed approaches can improve MT performance significantly, thereby verifying the effectiveness of shared Chinese characters for Chinese-Japanese MT.

References

[1]
Bai, M.-H., Chen, K.-J., and Chang, J. S. 2008. Improving word alignment by adjusting Chinese word segmentation. In Proceedings of the 3rd International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 249--256.
[2]
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Assoc. Comput. Linguist. 19, 2, 263--312.
[3]
Chang, P.-C., Galley, M., and Manning, C. D. 2008. Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the 3rd Workshop on Statistical Machine Translation. Association for Computational Linguistics, 224--232.
[4]
Chen, W., Kawahara, D., Uchimoto, K., Zhang, Y., and Isahara, H. 2008. Dependency parsing with short dependency relation in unlabeled data. In Proceedings of the 3rd International Joint Conference on Natural Language Processing. 88--94.
[5]
Chou, Y.-M. and Huang, C.-R. 2006. Hantology: A linguistic resource for Chinese language processing and studying. In Proceedings of the 5th International Conference on Language Resources and Evaluation. 587--590.
[6]
Chou, Y.-M., Huang, C.-R., and Hong, J.-F. 2008. The extended architecture of Hantology for kanji. In Proceedings of the 6th International Conference on Language Resources and Evaluation. 1693--1696.
[7]
Chu, C., Nakazawa, T., and Kurohashi, S. 2011. Japanese-Chinese phrase alignment using common Chinese characters information. In Proceedings of the MT Summit XIII. 475--482.
[8]
Chu, C., Nakazawa, T., Kawahara, D., and Kurohashi, S. 2012a. Exploiting shared Chinese characters in Chinese word segmentation optimization for Chinese-Japanese machine translation. In Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT’12).
[9]
Chu, C., Nakazawa, T., and Kurohashi, S. 2012b. Chinese characters mapping table of Japanese, traditional Chinese and simplified Chinese. In Proceedings of the 8th Conference on International Language Resources and Evaluation (LREC’12).
[10]
Chu, C., Nakazawa, T., and Kurohashi, S. 2012c. Japanese-Chinese phrase alignment exploiting shared Chinese characters. In Proceedings of the 18th Annual Meeting of the Association for Natural Language Processing (NLP’12). 143--146.
[11]
Collins, M. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1--8.
[12]
DeNero, J. and Klein, D. 2007. Tailoring word alignments to syntactic machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, 17--24.
[13]
Goh, C.-L., Asahara, M., and Matsumoto, Y. 2005. Building a Japanese-Chinese dictionary using kanji/hanzi conversion. In Proceedings of the International Joint Conference on Natural Language Processing. 670--681.
[14]
Huang, C.-R., Chou, Y.-M., Hotani, C., Chen, S.-Y., and Lin, W.-Y. 2008. Multilingual conceptual access to lexicon based on shared orthography: An ontology-driven study of Chinese and Japanese. In Proceedings of the Workshop on Cognitive Aspects of the Lexicon (COGALEX 2008). 47--54.
[15]
Kawahara, D. and Kurohashi, S. 2006. A fully-lexicalized probabilistic model for Japanese syntactic and case structure analysis. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference. Association for Computational Linguistics, 176--183.
[16]
Koehn, P., Och, F. J., and Marcu, D. 2003. Statistical phrase-based translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’03). 127--133.
[17]
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Association for Computational Linguistics, 177--180.
[18]
Kondrak, G., Marcu, D., and Knight, K. 2003. Cognates can improve statistical translation models. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 46--48.
[19]
Kudo, T., Yamamoto, K., and Matsumoto, Y. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04). D. Lin and D. Wu Eds., Association for Computational Linguistics, 230--237.
[20]
Kurohashi, S., Nakamura, T., Matsumoto, Y., and Nagao, M. 1994. Improvements of Japanese morphological analyzer JUMAN. In Proceedings of the International Workshop on Sharable Natural Language. 22--28.
[21]
Low, J. K., Tou Ng, H., and Guo, W. 2005. A maximum entropy approach to Chinese word segmentation. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN’05). 161--164.
[22]
Ma, Y. and Way, A. 2009. Bilingually motivated domain-adapted word segmentation for statistical machine translation. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL’09). Association for Computational Linguistics, 549--557.
[23]
Nakazawa, T. and Kurohashi, S. 2011a. Bayesian subtree alignment model based on dependency trees. In Proceedings of the 5th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics.
[24]
Nakazawa, T. and Kurohashi, S. 2011b. EBMT system of KYOTO team in PatentMT task at NTCIR-9. In Proceedings of the 9th NTCIR Workshop Meeting on Evaluation of Information Access Technologies (NTCIR-9).
[25]
Niles, I. and Pease, A. 2001. Towards a standard upper ontology. In Proceedings of the International Conference on Formal Ontology in Information Systems. ACM Press, 2--9.
[26]
Och, F. J. and Ney, H. 2003. A systematic comparison of various statistical alignment models. Assoc. Comput. Linguist. 29, 1, 19--51.
[27]
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 311--318.
[28]
Peng, F., Feng, F., and McCallum, A. 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th International Conference on Computational Linguistics (COLING). 562--568.
[29]
Stolcke, A. 2002. SRILM -- An extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing (ICSLP). Vol. 2, 901--904.
[30]
Tan, C. L. and Nagao, M. 1995. Automatic alignment of Japanese-Chinese bilingual texts. IEICE Trans. Inform. Syst. E78-D, 1, 68--76.
[31]
Wang, Y., Uchimoto, K., Kazama, J., Kruengkrai, C., and Torisawa, K. 2010. Adapting Chinese word segmentation for machine translation based on short units. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10). 19--21.
[32]
Wang, Y., Kazama, J., Tsuruoka, Y., Chen, W., Zhang, Y., and Torisawa, K. 2011. Improving Chinese word segmentation and POS tagging with semi-supervised methods using large auto-analyzed data. In Proceedings of the 5th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 309--317.
[33]
Xia, F., Xue, M. P. N., Okurowski, M. E., Kovarik, J., dong Chiou, F., and Huang, S. 2000. Developing guidelines and ensuring consistency for Chinese text annotation. In Proceedings of the 2nd International Conference on Language Resources and Evaluation.
[34]
Xu, J., Zens, R., and Ney, H. 2004. Do we need Chinese word segmentation for statistical machine translation? In Proceedings of the ACL SIGHAN Workshop. O. Streiter and Q. Lu Eds., Association for Computational Linguistics, 122--128.

Cited By

View all

Index Terms

  1. Chinese-Japanese Machine Translation Exploiting Chinese Characters

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian Language Information Processing
    ACM Transactions on Asian Language Information Processing  Volume 12, Issue 4
    October 2013
    86 pages
    ISSN:1530-0226
    EISSN:1558-3430
    DOI:10.1145/2523057
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 October 2013
    Accepted: 01 April 2013
    Revised: 01 February 2013
    Received: 01 August 2012
    Published in TALIP Volume 12, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Chinese characters
    2. Chinese-Japanese
    3. machine translation
    4. phrase alignment
    5. segmentation

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)WA-NetEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109674140:COnline publication date: 15-Jan-2025
    • (2023)Italian-Chinese Neural Machine Translation: results and lessons learntProceedings of the 2023 ACM Conference on Information Technology for Social Good10.1145/3582515.3609567(455-461)Online publication date: 6-Sep-2023
    • (2020)Ancient Chinese Lexicon Construction Based on Unsupervised Algorithm of Minimum Entropy and CBDB OptimizationHuman Centered Computing10.1007/978-3-030-70626-5_15(143-149)Online publication date: 14-Dec-2020
    • (2019)A Word Segmentation Method of Ancient Chinese Based on Word AlignmentNatural Language Processing and Chinese Computing10.1007/978-3-030-32233-5_59(761-772)Online publication date: 9-Oct-2019
    • (2018)Fast-Syntax-Matching-Based Japanese-Chinese Limited Machine TranslationComputational Linguistics and Intelligent Text Processing10.1007/978-3-319-75487-1_6(63-73)Online publication date: 21-Mar-2018
    • (2017)An Approach for Chinese-Japanese Named Entity Equivalents Extraction Using Inductive Learning and Hanzi-Kanji Mapping TableIEICE Transactions on Information and Systems10.1587/transinf.2016EDP7425E100.D:8(1882-1892)Online publication date: 2017
    • (2016)Fast-Syntax-Matching-Based Japanese-Chinese Limited Machine TranslationNatural Language Understanding and Intelligent Applications10.1007/978-3-319-50496-4_55(621-630)Online publication date: 2-Dec-2016
    • (2015)Integrated Parallel Sentence and Fragment Extraction from Comparable CorporaACM Transactions on Asian and Low-Resource Language Information Processing10.1145/283308915:2(1-22)Online publication date: 11-Dec-2015
    • (2014)A bilingual word alignment algorithm of Vietnamese-Chinese based on feature constraintInternational Journal of Machine Learning and Cybernetics10.1007/s13042-014-0293-66:4(537-543)Online publication date: 26-Aug-2014

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media