ABSTRACT
We like to conversate with other people using both sounds and visuals, as our perception of speech is bimodal. Essentially echoing the same speech structure, we manage to integrate the two modalities and often understand the message better than with the eyes closed. In this work we would like to learn more about the visual nature of speech, coined lip-reading, and to make use of it towards better automatic speech recognition systems. Recent developments in the Machine Learning area, together with the release of suitable audio-visual datasets aimed at large vocabulary continuous speech recognition, have led to a renewal of the lip-reading topic, and allow us to address the recurring question of how to better integrate visual and acoustic speech.
- Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. 2016. LipNet: Sentence-level Lipreading. Vol. abs/1611.01599 (2016). http://arxiv.org/abs/1611.01599Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2018. Neural Machine Translation by Jointly Learning to Align and Translate International Conference on Learning Representations. http://arxiv.org/abs/1409.0473Google Scholar
- T. Baltruusaitis, P. Robinson, and L. P. Morency. 2016. OpenFace: An open source facial behavior analysis toolkit 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). 1--10.Google Scholar
- BBC and Oxford University. 2017. The BBC-Oxford Multi-View Lip Reading Sentences 2 (LRS2) Dataset. http://www.robots.ox.ac.uk/~vgg/data/lip_reading_sentences/. (2017). Online, Accessed: 11 August 2018.Google Scholar
- Chung-Cheng Chiu, Tara Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Katya Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2018. State-of-the-art Speech Recognition With Sequence-to-Sequence Models ICASSP. https://arxiv.org/pdf/1712.01769.pdfGoogle Scholar
- J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. 1993. DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM. (1993).Google Scholar
- F. A. Gers, J. Schmidhuber, and F. Cummins. 1999. Learning to forget: continual prediction with LSTM. IET Conference Proceedings (January. 1999), 850--855(5). http://digital-library.theiet.org/content/conferences/10.1049/cp_19991218Google Scholar
- Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML '06). ACM, New York, NY, USA, 369--376. Google ScholarDigital Library
- Naomi Harte and Eoin Gillen. 2015. TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech. IEEE Transactions on Multimedia Vol. 17, 5 (May. 2015), 603--615.Google ScholarCross Ref
- S. Kim, T. Hori, and S. Watanabe. 2017. Joint CTC-attention based end-to-end speech recognition using multi-task learning 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4835--4839.Google Scholar
- Edward Nitchie. 1919. Lip-reading Principles and Practice. Frederick A. Stokes Company.Google Scholar
- Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Feipeng Cai, Georgios Tzimiropoulos, and Maja Pantic. 2018. End-to-end Audiovisual Speech Recognition. In ICASSP. http://arxiv.org/abs/1802.06424Google Scholar
- G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. 2003. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE Vol. 91, 9 (Sept. 2003), 1306--1326.Google Scholar
- Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- George Sterpu and Naomi Harte. 2017. Towards lipreading sentences using Active Appearance Models AVSP. Stockholm, Sweden.Google Scholar
- George Sterpu, Christian Saam, and Naomi Harte. 2018. Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition 2018 International Conference on Multimodal Interaction (ICMI '18), October 16-20, 2018, Boulder, CO, USA. ACM, New York, NY, USA. Google ScholarDigital Library
- George Sterpu, Christian Saam, and Naomi Harte. 2018 b. Can DNNs Learn to Lipread Full Sentences? ArXiv e-prints (May. 2018). {arxiv}1805.11685Google Scholar
- F. Tao and C. Busso. 2018. Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing Vol. 26, 7 (July. 2018), 1286--1298. Google ScholarDigital Library
- Kwanchiva Thangthai, Helen L. Bear, and Richard Harvey. 2017. Comparing phonemes and visemes with DNN-based lipreading Workshop on Lip-Reading using deep learning methods (BMVC 2017).Google Scholar
- Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention Recurrent Network for Human Communication Comprehension AAAI Conference on Artificial Intelligence. https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17390Google Scholar
- Ziheng Zhou, Guoying Zhao, Xiaopeng Hong, and Matti Pietikäinen. 2014. A review of recent advances in visual speech decoding. Image and Vision Computing Vol. 32, 9 (2014), 590--605.Google ScholarCross Ref
Index Terms
Large Vocabulary Continuous Audio-Visual Speech Recognition
Recommendations
Visual model structures and synchrony constraints for audio-visual speech recognition
This paper presents the design and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. The audio and visual feature streams are integrated using a segment-constrained hidden ...
Audio-visual speech recognition using MPEG-4 compliant visual features
We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial ...
An Arabic Visual Dataset for Visual Speech Recognition
AbstractVisual speech recognition (VSR) has received increasing attention in recent decades due to its potential uses in many applications. As for any recognition system, useful materials for training and testing are required. For VSR system development, ...
Comments