skip to main content
research-article

Model Generation of Accented Speech using Model Transformation and Verification for Bilingual Speech Recognition

Published: 20 April 2015 Publication History

Abstract

Nowadays, bilingual or multilingual speech recognition is confronted with the accent-related problem caused by non-native speech in a variety of real-world applications. Accent modeling of non-native speech is definitely challenging, because the acoustic properties in highly-accented speech pronounced by non-native speakers are quite divergent. The aim of this study is to generate highly Mandarin-accented English models for speakers whose mother tongue is Mandarin. First, a two-stage, state-based verification method is proposed to extract the state-level, highly-accented speech segments automatically. Acoustic features and articulatory features are successively used for robust verification of the extracted speech segments. Second, Gaussian components of the highly-accented speech models are generated from the corresponding Gaussian components of the native speech models using a linear transformation function. A decision tree is constructed to categorize the transformation functions and used for transformation function retrieval to deal with the data sparseness problem. Third, a discrimination function is further applied to verify the generated accented acoustic models. Finally, the successfully verified accented English models are integrated into the native bilingual phone model set for Mandarin-English bilingual speech recognition. Experimental results show that the proposed approach can effectively alleviate recognition performance degradation due to accents and can obtain absolute improvements of 4.1%, 1.8%, and 2.7% in word accuracy for bilingual speech recognition compared to that using traditional ASR approaches, MAP-adapted, and MLLR-adapted ASR methods, respectively.

References

[1]
Bimbot, F., Bonastre, J.-F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcia, J., Petrovska-Delacretaz, D., and Reynolds, D. A., 2004. A tutorial on text-independent speaker verification. J. Appl. Signal Process. 4, 430--451.
[2]
Bouselmi, G., Fohr, D., and Illina, I. 2007. Combined acoustic and pronunciation modeling for non-native speech recognition. In Proceedings of Interspeech. 1449--1452.
[3]
Bouselmi, G., Fohr, D., and Illina, I. 2008. Multi-accent and accent-independent non-native speech recognition. In Proceedings of Interspeech.
[4]
Campbell, W. M., Brady, K. J., Campbell, J. P., Granville, R. D., and Reynolds, A. 2006. Understanding scores in forensic speaker recognition. In Proceedings of the IEEE Odyssey: Speaker and Language Recognition Workshop. 1--8.
[5]
Chao, Y.-H., Wang, H.-M., and Chang, R.-C. 2007. A novel characterization of the alternative hypothesis using kernel discriminant analysis for LLR-based speaker verification. Int. J. Comput. Linguist. Chinese Lang. Process. 12, 3, 255--272.
[6]
Chen, N. F., Shen, W., Campbell, J. P., and Torres-Carrasquillo, P. 2011. Informative dialect recognition using context-dependent pronunciation modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 4396--4399.
[7]
Chen, Y.-J., Wu, C.-H., Chiu, Y.-H., and Liao, H.-C. 2002. Generation of robust phonetic set and decision tree for Mandarin using chi-square testing. Speech Commun. 38, 3--4, 349--364.
[8]
Chomsky, N. and Halle, M. 1968. The Sound Pattern of English. Harper & Row, New York, NY.
[9]
Dempster, A. P., Laird, N. M., and Rubin, D. B. 1997. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Series B (Methodological) 39, 1, 1--38.
[10]
English Across Taiwan. 2005. EAT. http://www.aclclp.org.tw/use_mat.php#eat.
[11]
Felps, D., Geng, C., and Gutierrez-Osuna, R. 2012. Foreign accent conversion through concatenative synthesis in the articulatory domain. IEEE Trans. Audio, Speech, Lang. Process. 20, 8, 2301--2312.
[12]
Fisher, W. M., Doddington, G. R., and Goudie-Marshall, K. M. 1986. The DARPA speech recognition research database: Specifications and status. In Proceedings of the DARPA Workshop on Speech Recognition. 93--99.
[13]
Fung, P. and Liu, Y. 2005. Effects and modeling of phonetic and acoustic confusions in accented speech. J. Acoust. Soc. Am. 118, 5, 3279--3293.
[14]
Fukada, T., Yoshimura, T., and Sagisaka, Y. 1999. Automatic generation of multiple pronunciations based on neural networks. Speech Commun. 27, 1, 63--73.
[15]
Hieronymus, J. L. 1994. ASCII phonetic symbols for the world’s languages: Worldbet. AT&T Tech.Rep. http://www.cslu.ogi.edu/publications/.
[16]
Huang, C.-L. and Wu, C.-H. 2007. Generation of phonetic units for mixed-language speech recognition based on acoustic and contextual analysis. IEEE Trans. Comput. 56, 9, 1245--1254.
[17]
Huang, Y.-C., Wu, C.-H., and Chao, Y.-T. 2013. Personalized spectral and prosody conversion using frame-based codeword distribution and adaptive CRF. IEEE Trans. Audio, Speech, Lang. Process.21, 1, 51--62.
[18]
Hwang, M.-Y. and Huang, X. 1992. Subphonetic modeling with Markov states-senone. In Proceedings of ICASSP. Vol. 1, 33--36.
[19]
Hwang, M.-Y., Huang, X., and Alleva, F. 1996. Predicting unseen triphones with senones. IEEE Trans. Speech and Audio Process. 4, 6, 412--419.
[20]
International Phonetic Association (IPA) 1999. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press.
[21]
Joachims, T. 2002. Learning to classify text using support vector machines. Ph.D. Dissertation, Cornell University. Kluwer Academic Publishers, Springer.
[22]
Joachims, T. 1999. Making large-scale SVM learning practical. In Advances in Kernel Methods:Support Vector Learning. Schölkopf, B., Burges, C., and Smola, A. Eds. MIT Press.
[23]
Kirchhoff, K. 2000. Integrating articulatory features into acoustic models for speech recognition. In Proceedings of the Workshop on Phonetics and Phonology in ASR.
[24]
Lebese, E., Manamela, J., and Gasela, N. 2012. Towards a multilingual recognition system based on phone-clustering scheme for decoding local languages. In Proceedings of Southern Africa Telecommunication Networks and Applications Conference (SATNAC).
[25]
Lee, C.-H., Clements, M. A., Dusan, S., Fosler-Lussier, E., Johnson, K., Juang, B.-H., and Rabiner, L. R. 2007. An overview on automatic speech attribute transcription (ASAT). In Proceedings of Interspeech.
[26]
Liu, Y. and Fung, P. 2004. State-dependent phonetic tied mixtures with pronunciation modeling for spontaneous speech recognition. IEEE Trans. Speech Audio Process. 12, 4, 351--364.
[27]
Livescu, K. and Glass, J. 2000. Lexical modeling of non-native speech for automatic speech recognition. In Proceedings of ICASSP. Vol. 3.
[28]
Mokbel, H. and Jouvet, D. 1998. Derivation of the optimal phonetic transcription set for a word from its acoustic realizations. In Proceedings of the ESCA Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition. 73--78.
[29]
Oh, Y. R., Yoon, J. S., and Kim, H. K. 2007. Acoustic model adaptation based on pronunciation variability analysis for non-native speech recognition. Speech Commun. 49, 1, 59--70.
[30]
Ostendorf, M. 1999. Moving beyond the ‘beads-on-a-string’ model of speech. In Proceedings of the Automatic Speech Recognition and Understanding Workshop. Vol. 1, 79.
[31]
Qian, Y., Povey, D., and Liu, J. 2011. State-level data borrowing for low-resource speech recognition based on subspace GMMs. In Proceedings of Interspeech.
[32]
Ravishankar, M. and Eskenazi, M. 1997. Automatic generation of context-dependent pronunciations. In Proceedings of EuroSpeech. 2467--2470.
[33]
Rosenberg, A. E., DeLong, J., Lee, C.-H., Juang, B.-H., and Soong, F. K. 1992. The use of cohort normalized scores for speaker verification. In Proceedings of the 2nd International Conference on Spoken Language Processing.
[34]
Siniscalchi, S. M., Svendsen, T., and Lee, C.-H. 2008. Toward a detector-based universal phone recognizer. In Proceedings of ICASSP. 4261--4264.
[35]
Stefan, S. 2003. Generating non-native pronunciation lexicons by phonological rules. In Proceedings of International Conference of Phonetic Sciences (ICPhS). 2545--2548.
[36]
Strik, H. and Cucchiarini, C. 1999. Modeling pronunciation variation for ASR: A survey of the literature. Speech Commun. 29, 2--4, 225--246.
[37]
Stüker, S., Metze, F., Schultz, T., and Waibel, A. 2003a. Integrating multilingual articulatory features into speech recognition. In Proceedings of EuroSpeech.
[38]
Stüker, S., Schultz, T., Metze, F., and Waibel, A. 2003b. Multilingual articulatory features. In Proceedings of ICASSP. Vol. 1, I-144--I-147.
[39]
Stylianou, Y., Cappe, O., and Moulines, E. 1998. Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6, 2, 131--142.
[40]
Tomokiyo, L. M. and Waibel, A. 2001. Adaptation methods for non-native speech. In Proceedings of Multilinguality in Spoken Language Processing.
[41]
Torre, D., Villarrubia, L., Hernández, L., and Elvira, J. 1997. Automatic alternative transcription generation and vocabulary selection for flexible word recognizers. In Proceedings of ICASSP. Vol. 2, 1463--1466.
[42]
Vu, N. T., Lyu, D.-C., Weiner, J., Telaar, D., Schlippe, T., Blaicher, F., Chng, E.-S., Schultz, T., and Li, H. 2012. A first speech recognition system for Mandarin-English code-switch conversational speech. In Proceedings of ICASSP. 4889--4892.
[43]
Wang, Z., Schultz, T., and Waibel, A. 2003. Comparison of acoustic model adaptation techniques on non-native speech. In Proceedings of ICASSP, Vol. 1, I-540--I-543.
[44]
Wang, Z. and Schultz, T. 2003. Non-native spontaneous speech recognition through polyphone decision tree specialization. In Proceedings of Eurospeech.
[45]
Weiner, J., Vu, N. T., Telaar, D., Metze, F., Schultz, T., Lyu, D.-C., Chng, E.-S., and Li, H. 2012. Integration of language identification into a recognition system for spoken conversations containing code-switches. In Proceedings of the IEEE Workshop of Spoken Language Technology (SLT).
[46]
Wells, J. C. 1989. Computer-coded phonemic notation of individual languages of the European community. J. Int. Phonetic Assoc. 19, 1, 31--54.
[47]
Williams, G. and Renals, S. 1998. Confidence measures for evaluating pronunciation models. In Proceedings of the ESCA Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition. 151--156.
[48]
Wu, C.-H., Shen, H.-P., and Yang, Y.-T. 2012. Phone set construction based on context-sensitive articulatory attributes for code-switching speech recognition. In Proceedings of ICASSP.4865--4868.
[49]
Wu, C.-H., Su, H.-Y., and Shen, H.-P. 2011. Articulation-disordered speech recognition using speaker-adaptive acoustic models and personalized articulation patterns. ACM Trans. Asian Lang. Inform. Process. 10, 2, Article 7.
[50]
Yang, J., Wu, P., and Xu, D. 2008. Mandarin speech recognition for nonnative speakers based on pronunciation dictionary adaptation. In Proceedings of International Symposium on Chinese Spoken Language Processing (ISCSLP). 1--4.
[51]
Yeh, C.-F., Heidel, A., Lee, H.-Y., and Lee, L.-S. 2012. Recognition of highly imbalanced code-mixed bilingual speech with frame-level language detection based on blurred posteriorgram. In Proceedings of ICASSP. 4873--4876.
[52]
Yeh, C.-F., Huang, C.-Y., and Lee, L.-S. 2011. Bilingual acoustic model adaptation by unit merging on different levels and cross-level integration. In Proceedings of Interspeech.
[53]
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.-Y., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. 2006. The Hidden Markov Model Toolkit (HTK) Version 3.4. http://htk.eng.cam.ac.uk/.
[54]
Yu, S., Zhang, S., and Xu, B. 2004. Chinese-English bilingual phone modeling for cross-language speech recognition. In Proceedings of ICASSP, Vol. 1, I-917--20.
[55]
Zhang, C., Liu, Y., and Lee, C.-H. 2011. Detection-based accented speech recognition using articulatory features. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).
[56]
Zhang, Q., Li, T., Pan, J., and Yan, Y. 2008. Nonnative speech recognition based on state-level bilingual model modification. In Proceedings of the 3rd International Conference on Convergence and Hybrid Information Technology (ICCIT). Vol. 2, 1220--1225.

Cited By

View all
  • (2023)Bilingual Automatic Speech Recognition: A Review, Taxonomy and Open ChallengesIEEE Access10.1109/ACCESS.2022.321868411(5944-5954)Online publication date: 2023

Index Terms

  1. Model Generation of Accented Speech using Model Transformation and Verification for Bilingual Speech Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 14, Issue 2
    March 2015
    96 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/2764912
    Issue’s Table of Contents
    © 2015 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 April 2015
    Accepted: 01 August 2014
    Revised: 01 June 2014
    Received: 01 August 2013
    Published in TALLIP Volume 14, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Accented speech
    2. articulatory feature
    3. bilingual speech recognition

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Bilingual Automatic Speech Recognition: A Review, Taxonomy and Open ChallengesIEEE Access10.1109/ACCESS.2022.321868411(5944-5954)Online publication date: 2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media