research-article

Chinese-English mixed text normalization

Authors:
Qi Zhang

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Huan Chen

Fudan Univerisity, Shanghai, China

Fudan Univerisity, Shanghai, China
View Profile

,
Xuanjing Huang

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

WSDM '14: Proceedings of the 7th ACM international conference on Web search and data miningFebruary 2014Pages 433–442https://doi.org/10.1145/2556195.2556228

Published:24 February 2014Publication History

WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining

Pages 433–442

ABSTRACT

Along with the expansion of globalization, multilingualism has become a popular social phenomenon. More than one language may occur in the context of a single conversation. This phenomenon is also prevalent in China. A huge variety of informal Chinese texts contain English words, especially in emails, social media, and other user generated informal contents. Since most of the existing natural language processing algorithms were designed for processing monolingual information, mixed multilingual texts cannot be well analyzed by them. Hence, it is of critical importance to preprocess the mixed texts before applying other tasks. In this paper, we firstly analyze the phenomena of mixed usage of Chinese and English in Chinese microblogs. Then, we detail the proposed two-stage method for normalizing mixed texts. We propose to use a noisy channel approach to translate in-vocabulary words into Chinese. For better incorporating the historical information of users, we introduce a novel user aware neural network language model. For the out-of-vocabulary words (such as pronunciations, informal expressions and et al.), we propose to use a graph-based unsupervised method to categorize them. Experimental results on a manually annotated microblog dataset demonstrate the effectiveness of the proposed method. We also evaluate three natural language parsers with and without using the proposed method as the preprocessing step. From the results, we can see that the proposed method can significantly benefit other NLP tasks in processing mixed text.

References

A. Aw, M. Zhang, J. Xiao, and J. Su. A phrase-based statistical model for sms text normalization. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 33--40, Sydney, Australia, July 2006. Association for Computational Linguistics. Google ScholarDigital Library
R. Beaufort, S. Roekhaut, L.-A. Cougnon, and C. Fairon. A hybrid rule/model-based finite-state framework for normalizing sms messages. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL '10, pages 770--779, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarDigital Library
J.-S. Chang and W.-L. Teng. Mining atomic chinese abbreviations with a probabilistic single character recovery model. Language Resources and Evaluation, 40(3--4):367--374, 2006.Google Scholar
R. Collobert and J. Weston. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, ICML '08, pages 160--167, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
D. Das and S. Petrov. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 600--609, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. Google ScholarDigital Library
L. Derczynski, A. Ritter, S. Clark, and K. Bontcheva. Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. Association for Computational Linguistics, 2013.Google Scholar
D. Freitag. Machine learning for information extraction in informal domains. Machine Learning, 39(2--3):169--202, 2000. Google ScholarDigital Library
B. Han and T. Baldwin. Lexical normalisation of short text messages: Makn sens a#twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 368--378, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. Google ScholarDigital Library
B. Han, P. Cook, and T. Baldwin. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 421--432, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarDigital Library
B. Han, P. Cook, and T. Baldwin. Lexical normalization for social media text. ACM Trans. Intell. Syst. Technol., 4(1):5:1--5:27, Feb. 2013. Google ScholarDigital Library
E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL '12, pages 873--882, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarDigital Library
M. Johnson and A. E. Ural. Reranking the berkeley and brown parsers. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 665--668, Los Angeles, California, June 2010. Association for Computational Linguistics. Google ScholarDigital Library
C. Kobus, F. Yvon, and G. Damnati. Normalizing sms: are two metaphors better than one? In Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, COLING '08, pages 441--448, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarDigital Library
C. Li and Y. Liu. Improving text normalization using character-blocks based models and system combination. In Proceedings of COLING 2012, pages 1587--1602, Mumbai, India, December 2012. The COLING 2012 Organizing Committee.Google Scholar
X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, pages 339--346, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
Z. Li and D. Yarowsky. Mining and modeling relations between formal and informal chinese phrases from web corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08, pages 1031--1040, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarDigital Library
Z. Li and D. Yarowsky. Unsupervised translation induction for chinese abbreviations using monolingual corpora. In Proceedings of ACL-08: HLT, pages 425--433, Columbus, Ohio, June 2008. Association for Computational Linguistics.Google Scholar
F. Liu, F. Weng, and X. Jiang. A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL '12, pages 1035--1044, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarDigital Library
E. Minkov, R. C. Wang, and W. W. Cohen. Extracting personal names from email: applying named entity recognition to informal text. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT '05, pages 443--450, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics. Google ScholarDigital Library
T. Mullen and R. Malouf. A preliminary investigation into sentiment analysis of informal political discourse. In Proceedings of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, 2006.Google Scholar
Z.-Y. Niu, D.-H. Ji, and C. L. Tan. Word sense disambiguation using label propagation based semi-supervised learning. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL '05, pages 395--402, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics. Google ScholarDigital Library
F. J. Och and H. Ney. A systematic comparison of various statistical alignment models. Comput. Linguist., 29(1):19--51, Mar. 2003. Google ScholarDigital Library
N. Okazaki, M. Ishizuka, and J. Tsujii. A discriminative approach to japanese abbreviation extraction. In Proceedings of the Third International Joint Conference on Natural Language Processing (IJCNLP 2008), pages 889--894, 2008.Google Scholar
X. Qian, Q. Zhang, X. Huang, and L. Wu. 2d trie for fast parsing. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING '10, pages 904--912, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarDigital Library
X. Qiu, Q. Zhang, and X. Huang. Fudannlp: A toolkit for chinese natural language processing. In Proceedings of ACL, 2013.Google Scholar
G. Richard. A global perspective on bilingualism and bilingual education. Georgetown University Round Table on Languages and Linguistics 1999: Language in Our Time: Bilingual Education and Official English, Ebonics and Standard English, Immigration and the Unz Initiative Languages and Linguistics 1999, page 332, 2001.Google Scholar
A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '11, pages 1524--1534, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. Google ScholarDigital Library
R. Socher, J. Bauer, C. D. Manning, and A. Y. Ng. Parsing with compositional vector grammars. In Proceedings of ACL 2013, June 2013.Google Scholar
A. Tamura, T. Watanabe, and E. Sumita. Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 24--36, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarDigital Library
M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, and A. Kappas. Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12):2544--2558, 2010. Google ScholarDigital Library
P. D. Turney and M. L. Littman. Unsupervised learning of semantic orientation from a hundred-billion-word corpus. (No. ERB-1094, NRC#44929): National Research Council of Canada, 2002.Google Scholar
L. Velikovich, S. Blair-Goldensohn, K. Hannan, and R. McDonald. The viability of web-derived polarity lexicons. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT '10, pages 777--785, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarDigital Library
P. Wang and H. T. Ng. A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 471--481, Atlanta, Georgia, June 2013. Association for Computational Linguistics.Google Scholar
L.-X. Xie, Y.-B. Zheng, Z.-Y. Liu, M.-S. Sun, and C.-H. Wang. Extracting chinese abbreviation-definition pairs from anchor texts. In Machine Learning and Cybernetics (ICMLC), volume 4, pages 1485--1491, 2011.Google ScholarCross Ref
D. Yang, Y.-C. Pan, and S. Furui. Vocabulary expansion through automatic abbreviation generation for chinese voice search. Computer Speech & Language, 26(5):321--335, 2012. Google ScholarDigital Library
J. Zhao, X. Qiu, S. Zhang, F. Ji, and X. Huang. Part-of-speech tagging for chinese-english mixed texts with dynamic features. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 1379--1388, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarDigital Library
X. Zhu and Z. Ghahramani. Learning from Labeled and Unlabeled Data with Label Propagation. In Technical Report Carnegie Mellon University-CALD-02-107. Carnegie Mellon University, 2002.Google Scholar

Index Terms

Chinese-English mixed text normalization
1. Information systems
  1. Information retrieval

Recommendations

Building a Chinese-English wordnet for translingual applications

A WordNet-like linguistic resource is useful, but difficult to construct. This article proposes a method to integrate five linguistic resources, including English/Chinese sense-tagged corpora, English/Chinese thesauruses, and a bilingual dictionary. ...
Read More
Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval
IRAL '00: Proceedings of the fifth international workshop on on Information retrieval with Asian languages

We investigated using the LDC English/Chinese bilingual wordlists for English-Chinese cross language retrieval. It is shown that the Chinese-to-English wordlist can be considered as both a phrase and word dictionary, and is preferable to the English-to-...
Read More
Automatic transliteration for Japanese-to-English text retrieval
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

For cross language information retrieval (CLIR) based on bilingual translation dictionaries, good performance depends upon lexical coverage in the dictionary. This is especially true for languages possessing few inter-language cognates, such as between ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining
February 2014
712 pages
ISBN:9781450323512
DOI:10.1145/2556195
General Chairs:
Ben Carterette
University of Delaware, USA
,
Fernando Diaz
Microsoft Research, USA
,
Program Chairs:
Carlos Castillo
Qatar Computing Research Institute, Qatar
,
Donald Metzler
Google, USA
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 February 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
chinese-english mixed text
user aware neural network language model
words normalization
Qualifiers
- research-article
Conference

Acceptance Rates
WSDM '14 Paper Acceptance Rate64of355submissions,18%Overall Acceptance Rate498of2,863submissions,17%
More
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 435
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Chinese-English mixed text normalization

WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Building a Chinese-English wordnet for translingual applications

Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval

Automatic transliteration for Japanese-to-English text retrieval