ABSTRACT
Along with the expansion of globalization, multilingualism has become a popular social phenomenon. More than one language may occur in the context of a single conversation. This phenomenon is also prevalent in China. A huge variety of informal Chinese texts contain English words, especially in emails, social media, and other user generated informal contents. Since most of the existing natural language processing algorithms were designed for processing monolingual information, mixed multilingual texts cannot be well analyzed by them. Hence, it is of critical importance to preprocess the mixed texts before applying other tasks. In this paper, we firstly analyze the phenomena of mixed usage of Chinese and English in Chinese microblogs. Then, we detail the proposed two-stage method for normalizing mixed texts. We propose to use a noisy channel approach to translate in-vocabulary words into Chinese. For better incorporating the historical information of users, we introduce a novel user aware neural network language model. For the out-of-vocabulary words (such as pronunciations, informal expressions and et al.), we propose to use a graph-based unsupervised method to categorize them. Experimental results on a manually annotated microblog dataset demonstrate the effectiveness of the proposed method. We also evaluate three natural language parsers with and without using the proposed method as the preprocessing step. From the results, we can see that the proposed method can significantly benefit other NLP tasks in processing mixed text.
- A. Aw, M. Zhang, J. Xiao, and J. Su. A phrase-based statistical model for sms text normalization. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 33--40, Sydney, Australia, July 2006. Association for Computational Linguistics. Google ScholarDigital Library
- R. Beaufort, S. Roekhaut, L.-A. Cougnon, and C. Fairon. A hybrid rule/model-based finite-state framework for normalizing sms messages. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL '10, pages 770--779, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarDigital Library
- J.-S. Chang and W.-L. Teng. Mining atomic chinese abbreviations with a probabilistic single character recovery model. Language Resources and Evaluation, 40(3--4):367--374, 2006.Google Scholar
- R. Collobert and J. Weston. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, ICML '08, pages 160--167, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- D. Das and S. Petrov. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 600--609, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. Google ScholarDigital Library
- L. Derczynski, A. Ritter, S. Clark, and K. Bontcheva. Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. Association for Computational Linguistics, 2013.Google Scholar
- D. Freitag. Machine learning for information extraction in informal domains. Machine Learning, 39(2--3):169--202, 2000. Google ScholarDigital Library
- B. Han and T. Baldwin. Lexical normalisation of short text messages: Makn sens a#twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 368--378, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. Google ScholarDigital Library
- B. Han, P. Cook, and T. Baldwin. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 421--432, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarDigital Library
- B. Han, P. Cook, and T. Baldwin. Lexical normalization for social media text. ACM Trans. Intell. Syst. Technol., 4(1):5:1--5:27, Feb. 2013. Google ScholarDigital Library
- E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL '12, pages 873--882, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarDigital Library
- M. Johnson and A. E. Ural. Reranking the berkeley and brown parsers. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 665--668, Los Angeles, California, June 2010. Association for Computational Linguistics. Google ScholarDigital Library
- C. Kobus, F. Yvon, and G. Damnati. Normalizing sms: are two metaphors better than one? In Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, COLING '08, pages 441--448, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarDigital Library
- C. Li and Y. Liu. Improving text normalization using character-blocks based models and system combination. In Proceedings of COLING 2012, pages 1587--1602, Mumbai, India, December 2012. The COLING 2012 Organizing Committee.Google Scholar
- X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, pages 339--346, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- Z. Li and D. Yarowsky. Mining and modeling relations between formal and informal chinese phrases from web corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08, pages 1031--1040, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarDigital Library
- Z. Li and D. Yarowsky. Unsupervised translation induction for chinese abbreviations using monolingual corpora. In Proceedings of ACL-08: HLT, pages 425--433, Columbus, Ohio, June 2008. Association for Computational Linguistics.Google Scholar
- F. Liu, F. Weng, and X. Jiang. A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL '12, pages 1035--1044, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarDigital Library
- E. Minkov, R. C. Wang, and W. W. Cohen. Extracting personal names from email: applying named entity recognition to informal text. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT '05, pages 443--450, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics. Google ScholarDigital Library
- T. Mullen and R. Malouf. A preliminary investigation into sentiment analysis of informal political discourse. In Proceedings of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, 2006.Google Scholar
- Z.-Y. Niu, D.-H. Ji, and C. L. Tan. Word sense disambiguation using label propagation based semi-supervised learning. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL '05, pages 395--402, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics. Google ScholarDigital Library
- F. J. Och and H. Ney. A systematic comparison of various statistical alignment models. Comput. Linguist., 29(1):19--51, Mar. 2003. Google ScholarDigital Library
- N. Okazaki, M. Ishizuka, and J. Tsujii. A discriminative approach to japanese abbreviation extraction. In Proceedings of the Third International Joint Conference on Natural Language Processing (IJCNLP 2008), pages 889--894, 2008.Google Scholar
- X. Qian, Q. Zhang, X. Huang, and L. Wu. 2d trie for fast parsing. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING '10, pages 904--912, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarDigital Library
- X. Qiu, Q. Zhang, and X. Huang. Fudannlp: A toolkit for chinese natural language processing. In Proceedings of ACL, 2013.Google Scholar
- G. Richard. A global perspective on bilingualism and bilingual education. Georgetown University Round Table on Languages and Linguistics 1999: Language in Our Time: Bilingual Education and Official English, Ebonics and Standard English, Immigration and the Unz Initiative Languages and Linguistics 1999, page 332, 2001.Google Scholar
- A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '11, pages 1524--1534, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. Google ScholarDigital Library
- R. Socher, J. Bauer, C. D. Manning, and A. Y. Ng. Parsing with compositional vector grammars. In Proceedings of ACL 2013, June 2013.Google Scholar
- A. Tamura, T. Watanabe, and E. Sumita. Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 24--36, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarDigital Library
- M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, and A. Kappas. Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12):2544--2558, 2010. Google ScholarDigital Library
- P. D. Turney and M. L. Littman. Unsupervised learning of semantic orientation from a hundred-billion-word corpus. (No. ERB-1094, NRC#44929): National Research Council of Canada, 2002.Google Scholar
- L. Velikovich, S. Blair-Goldensohn, K. Hannan, and R. McDonald. The viability of web-derived polarity lexicons. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT '10, pages 777--785, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarDigital Library
- P. Wang and H. T. Ng. A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 471--481, Atlanta, Georgia, June 2013. Association for Computational Linguistics.Google Scholar
- L.-X. Xie, Y.-B. Zheng, Z.-Y. Liu, M.-S. Sun, and C.-H. Wang. Extracting chinese abbreviation-definition pairs from anchor texts. In Machine Learning and Cybernetics (ICMLC), volume 4, pages 1485--1491, 2011.Google ScholarCross Ref
- D. Yang, Y.-C. Pan, and S. Furui. Vocabulary expansion through automatic abbreviation generation for chinese voice search. Computer Speech & Language, 26(5):321--335, 2012. Google ScholarDigital Library
- J. Zhao, X. Qiu, S. Zhang, F. Ji, and X. Huang. Part-of-speech tagging for chinese-english mixed texts with dynamic features. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 1379--1388, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarDigital Library
- X. Zhu and Z. Ghahramani. Learning from Labeled and Unlabeled Data with Label Propagation. In Technical Report Carnegie Mellon University-CALD-02-107. Carnegie Mellon University, 2002.Google Scholar
Index Terms
- Chinese-English mixed text normalization
Recommendations
Building a Chinese-English wordnet for translingual applications
A WordNet-like linguistic resource is useful, but difficult to construct. This article proposes a method to integrate five linguistic resources, including English/Chinese sense-tagged corpora, English/Chinese thesauruses, and a bilingual dictionary. ...
Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval
IRAL '00: Proceedings of the fifth international workshop on on Information retrieval with Asian languagesWe investigated using the LDC English/Chinese bilingual wordlists for English-Chinese cross language retrieval. It is shown that the Chinese-to-English wordlist can be considered as both a phrase and word dictionary, and is preferable to the English-to-...
Automatic transliteration for Japanese-to-English text retrieval
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrievalFor cross language information retrieval (CLIR) based on bilingual translation dictionaries, good performance depends upon lexical coverage in the dictionary. This is especially true for languages possessing few inter-language cognates, such as between ...
Comments