skip to main content
10.1145/3242969.3264976acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Large Vocabulary Continuous Audio-Visual Speech Recognition

Published:02 October 2018Publication History

ABSTRACT

We like to conversate with other people using both sounds and visuals, as our perception of speech is bimodal. Essentially echoing the same speech structure, we manage to integrate the two modalities and often understand the message better than with the eyes closed. In this work we would like to learn more about the visual nature of speech, coined lip-reading, and to make use of it towards better automatic speech recognition systems. Recent developments in the Machine Learning area, together with the release of suitable audio-visual datasets aimed at large vocabulary continuous speech recognition, have led to a renewal of the lip-reading topic, and allow us to address the recurring question of how to better integrate visual and acoustic speech.

References

  1. Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. 2016. LipNet: Sentence-level Lipreading. Vol. abs/1611.01599 (2016). http://arxiv.org/abs/1611.01599Google ScholarGoogle Scholar
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2018. Neural Machine Translation by Jointly Learning to Align and Translate International Conference on Learning Representations. http://arxiv.org/abs/1409.0473Google ScholarGoogle Scholar
  3. T. Baltruusaitis, P. Robinson, and L. P. Morency. 2016. OpenFace: An open source facial behavior analysis toolkit 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). 1--10.Google ScholarGoogle Scholar
  4. BBC and Oxford University. 2017. The BBC-Oxford Multi-View Lip Reading Sentences 2 (LRS2) Dataset. http://www.robots.ox.ac.uk/~vgg/data/lip_reading_sentences/. (2017). Online, Accessed: 11 August 2018.Google ScholarGoogle Scholar
  5. Chung-Cheng Chiu, Tara Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Katya Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2018. State-of-the-art Speech Recognition With Sequence-to-Sequence Models ICASSP. https://arxiv.org/pdf/1712.01769.pdfGoogle ScholarGoogle Scholar
  6. J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. 1993. DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM. (1993).Google ScholarGoogle Scholar
  7. F. A. Gers, J. Schmidhuber, and F. Cummins. 1999. Learning to forget: continual prediction with LSTM. IET Conference Proceedings (January. 1999), 850--855(5). http://digital-library.theiet.org/content/conferences/10.1049/cp_19991218Google ScholarGoogle Scholar
  8. Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML '06). ACM, New York, NY, USA, 369--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Naomi Harte and Eoin Gillen. 2015. TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech. IEEE Transactions on Multimedia Vol. 17, 5 (May. 2015), 603--615.Google ScholarGoogle ScholarCross RefCross Ref
  10. S. Kim, T. Hori, and S. Watanabe. 2017. Joint CTC-attention based end-to-end speech recognition using multi-task learning 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4835--4839.Google ScholarGoogle Scholar
  11. Edward Nitchie. 1919. Lip-reading Principles and Practice. Frederick A. Stokes Company.Google ScholarGoogle Scholar
  12. Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Feipeng Cai, Georgios Tzimiropoulos, and Maja Pantic. 2018. End-to-end Audiovisual Speech Recognition. In ICASSP. http://arxiv.org/abs/1802.06424Google ScholarGoogle Scholar
  13. G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. 2003. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE Vol. 91, 9 (Sept. 2003), 1306--1326.Google ScholarGoogle Scholar
  14. Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  15. George Sterpu and Naomi Harte. 2017. Towards lipreading sentences using Active Appearance Models AVSP. Stockholm, Sweden.Google ScholarGoogle Scholar
  16. George Sterpu, Christian Saam, and Naomi Harte. 2018. Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition 2018 International Conference on Multimodal Interaction (ICMI '18), October 16-20, 2018, Boulder, CO, USA. ACM, New York, NY, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. George Sterpu, Christian Saam, and Naomi Harte. 2018 b. Can DNNs Learn to Lipread Full Sentences? ArXiv e-prints (May. 2018). {arxiv}1805.11685Google ScholarGoogle Scholar
  18. F. Tao and C. Busso. 2018. Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing Vol. 26, 7 (July. 2018), 1286--1298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kwanchiva Thangthai, Helen L. Bear, and Richard Harvey. 2017. Comparing phonemes and visemes with DNN-based lipreading Workshop on Lip-Reading using deep learning methods (BMVC 2017).Google ScholarGoogle Scholar
  20. Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention Recurrent Network for Human Communication Comprehension AAAI Conference on Artificial Intelligence. https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17390Google ScholarGoogle Scholar
  21. Ziheng Zhou, Guoying Zhao, Xiaopeng Hong, and Matti Pietikäinen. 2014. A review of recent advances in visual speech decoding. Image and Vision Computing Vol. 32, 9 (2014), 590--605.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Large Vocabulary Continuous Audio-Visual Speech Recognition

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction
              October 2018
              687 pages
              ISBN:9781450356923
              DOI:10.1145/3242969

              Copyright © 2018 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 2 October 2018

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              ICMI '18 Paper Acceptance Rate63of149submissions,42%Overall Acceptance Rate453of1,080submissions,42%
            • Article Metrics

              • Downloads (Last 12 months)8
              • Downloads (Last 6 weeks)1

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader