Article

Local spatiotemporal descriptors for visual recognition of spoken phrases

Authors:

Matti Pietikäinen,

Abdenour HadidAuthors Info & Claims

HCM '07: Proceedings of the international workshop on Human-centered multimedia

Pages 57 - 66

https://doi.org/10.1145/1290128.1290138

Published: 28 September 2007 Publication History

Abstract

Visual speech information plays an important role in speech recognition under noisy conditions or for listeners with hearing impairment. In this paper, we propose local spatiotemporal descriptors to represent and recognize spoken isolated phrases based solely on visual input. Positions of the eyes determined by a robust face and eye detector are used for localizing the mouth regions in face images. Spatiotemporal local binary patterns extracted from these regions are used for describing phrase sequences. In our experiments with 817 sequences from ten phrases and 20 speakers, promising accuracies of 62% and 70% were obtained in speaker-independent and speaker-dependent recognition, respectively. In comparison with other methods on the Tulips1 audio-visual database, the accuracy 92.7% of our method clearly out performs the others. Advantages of our approach include local processing and robustness to monotonic gray-scale changes. Moreover, no error prone segmentation of moving lips is needed.

References

[1]

Fox N., Gross R. and Chazal P. Person identification using automatic integration of speech, lip and face experts. ACM SIGMM workshop on Biometrics Methods and Applications, 2003, 25--32.

Digital Library

[2]

Frischholz R.W. and Dieckmann U. BioID: a multimodal biometric identification system. Computer, 33(2), 2000, 64--68.

Digital Library

[3]

Luettin J., Thacher N.A. and Beet S.W. Speaker identification by lipreading. International Conference on Spoken Language Proceedings (ICSLP), 1996, 62--64.

[4]

Potamianos G., Neti C., Gravier G., Garg A., and Senior A. Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE, 2003.

[5]

Arsic I. and Thiran J.P. Mutual information engenlips for audio-visual speech. 14th European Signal Processing Conference, Italy, 2006.

[6]

Gurban M. and Thiran J.P. Audio-visual speech recognition with a hybrid SVM-HMM system. 13th European Signal Processing Conference, 2005.

[7]

Lee B., Hasegawa-Johnson M., Goudeseune C., Kamdar S., Borys S., Liu M. and Huang T. AVICAR: Audio-visual speech corpus in a car environment. ICSLP, 2004, 2489--2492.

[8]

Gowdy J.N., Subramanya A., Bartels C. and Bilmes J. DBN based multi-stream models for audio-visual speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, Canada, 2004, 993--996.

[9]

Aleksic P.S. and Katsaggelos A.K. Product HMMs for audio-visual continuous speech recognition using facial animation parameters. International Conference on Multimedia and Expo (ICME), vol. 2, 2003, 481--484.

Digital Library

[10]

Nefian A.V., Liang L., Pi X., Liu X. and Murphy K. Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal on Applied Signal Processing 2002, 11, 1--15.

Digital Library

[11]

Aleksic P.S., Williams J.J., Wu Z., and Katsaggelos A.K. Audio--visual speech recognition using MPEG-4 compliant visual features. EURASIP Journal on Applied Signal Processing 2002, 11, 1213--1227.

Digital Library

[12]

Neti C., Potamianos G., Luettin J., Matthews I., Glotin H., Vergyri D., Sison J., Mashari A. and Zhou J. Audio-visual speech recognition. Center for Language and Speech Processing, The Johns Hopkins University, Baltimore, MD, Final workshop 2000 Report, Oct. 2000.

[13]

Basu S., Neti C., Rajput N., Senior A., Subramaniam L., Verma A. Audio-visual large vocabulary continuous speech recognition in the broadcast domain. IEEE 3rd Workshop on Multimedia Signal Processing, 475--481, 1999.

[14]

Niyogi P., Petajan E. and Zhong J. Feature based representation for audio-visual speech recognition. Audio Visual Speech Conference, 1999.

[15]

Brooke N.M. Using the visual component in automatic speech recognition. ICSLP, Vol. 3, 1996, 1656--1659.

[16]

Duchnowski P., Hunke M., Busching D., Meier U. and Waibel A. Toward movement-invariant automatic lipreading and speech recognition. ICSLP, 109--112, 1995.

[17]

McGurk H. and MacDonald J. Hearing lips and seeing voices. Nature, vol. 264, 1976, 746--748.

[18]

Potamianos G., Neti C., Luettin J., and Matthews I. Audio-visual automatic speech recognition: an overview. Issues in Visual and Audio-Visual Speech Processing, MIT Press, 2004.

[19]

Potamianos G., Graf H. P. and Cosatto E. An image transform approach for HMM based automatic lipreading. Proc. of ICIP 1998, Chicago, Illinois, 1998, 173--177.

[20]

Saenko K., Livescu K., Siracusa M., Wilson K., Glass J. and Darrell T. Visual speech recognition with loosely synchronized feature streams. ICCV, 2005, 1424--1431.

Digital Library

[21]

Saenko K., Livescu K., Glass J., and Darrell T. Production domain modeling of pronunciation for visual speech recognition. ICASSP, vol. 5, 2005, 473--476.

[22]

Chiou G.I. and Hwang J.N. Lipreading from color video. IEEE Transactions on Image Processing, 6(8), 1997, 1192--1195.

Digital Library

[23]

Matthews I., Cootes T.F., Bangham J.A., Cox S., and Harvey R. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 2002, 198--213.

Digital Library

[24]

Ojala T., Pietikäinen M., and Mäenpää T. Multiresolution gray scale and rotation invariant texture analysis with local binary patterns. IEEE Trans. Pattern Analysis and Machine Intelligence, 24(7), 971--987, 2002.

Digital Library

[25]

Ahonen T., Hadid A., and Pietikäinen M. Face description with local binary patterns: Application to face recognition. IEEE Trans. Pattern Analysis and Machine Intelligence, 28(12), 2006, 2037--2041.

Digital Library

[26]

Zhao G. and Pietikäinen M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 2007, 915--928.

Digital Library

[27]

Viola P. and Jones M. Rapid object detection using a boosted cascade of simple features. CVPR, 2001, 511--518.

[28]

Heusch G., Rodriguez Y., and Marcel S. Local binary patterns as an image preprocessing for face authentication. 7th International Conference on Automatic Face and Gesture Recognition (FG2006), 2006, 9--14.

Digital Library

[29]

Messer K., Matas J., Kittler J., Luettin J., and Maitre G. Xm2vtsdb: The extended m2vts database. Second International Conference on Audio and Video-Based Biometric Person Authentication, Washington, D.C., 1999.

[30]

Sanderson C. The VidTIMIT database. IDIAP Communication 02-06, Martigny, Switzerland, 2002.

[31]

Hazen T., Saenko K., La C. H., and Glass J. A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. Proc. ICMI, 2005.

Digital Library

[32]

Movellan J.R. Visual speech recognition with stochastic networks. Advances in Neural Information Processing Systems, vol. 7, 1995, pp. 851--858.

Cited By

Sandi VShoba S(2023)Significance of Convolutional Neural Network in View of Lip Reading for Speech-Impaired People2023 3rd International Conference on Emerging Frontiers in Electrical and Electronic Technologies (ICEFEET)10.1109/ICEFEET59656.2023.10452196(1-6)Online publication date: 21-Dec-2023
https://doi.org/10.1109/ICEFEET59656.2023.10452196
Huang HSong CTing JTian THong CDi ZGao D(2022)A Novel Machine Lip Reading ModelProcedia Computer Science10.1016/j.procs.2022.01.181199(1432-1437)Online publication date: 2022
https://doi.org/10.1016/j.procs.2022.01.181
Bhaskar SThasleema T(2021)CNN Based Feature Extraction for Visual Speech Recognition in MalayalamProceedings of Data Analytics and Management10.1007/978-981-16-6285-0_1(1-8)Online publication date: 22-Nov-2021
https://doi.org/10.1007/978-981-16-6285-0_1
Show More Cited By

Index Terms

Local spatiotemporal descriptors for visual recognition of spoken phrases
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
    2. HCI theory, concepts and models
  2. Interaction design
    1. Interaction design theory, concepts and paradigms

Recommendations

Lipreading with local spatiotemporal descriptors

Visual speech information plays an important role in lipreading under noisy conditions or for listeners with a hearing impairment. In this paper, we present local spatiotemporal descriptors to represent and recognize spoken isolated phrases based solely ...
An Arabic Visual Dataset for Visual Speech Recognition
Abstract
Visual speech recognition (VSR) has received increasing attention in recent decades due to its potential uses in many applications. As for any recognition system, useful materials for training and testing are required. For VSR system development, ...
Multistream Articulatory Feature-Based Models for Visual Speech Recognition

We study the problem of automatic visual speech recognition (VSR) using dynamic Bayesian network (DBN)-based models consisting of multiple sequences of hidden states, each corresponding to an articulatory feature (AF) such as lip opening (LO) or lip ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HCM '07: Proceedings of the international workshop on Human-centered multimedia

September 2007

112 pages

ISBN:9781595937810

DOI:10.1145/1290128

General Chairs:
Alejandro Jaimes
IDIAP, Switzerland
,
Nicu Sebe
University of Amsterdam, The Netherlands

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 September 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

MM07

Sponsor:

MM07: The 15th ACM International Conference on Multimedia 2007

September 28, 2007

Bavaria, Augsburg, Germany

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
335
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sandi VShoba S(2023)Significance of Convolutional Neural Network in View of Lip Reading for Speech-Impaired People2023 3rd International Conference on Emerging Frontiers in Electrical and Electronic Technologies (ICEFEET)10.1109/ICEFEET59656.2023.10452196(1-6)Online publication date: 21-Dec-2023
https://doi.org/10.1109/ICEFEET59656.2023.10452196
Huang HSong CTing JTian THong CDi ZGao D(2022)A Novel Machine Lip Reading ModelProcedia Computer Science10.1016/j.procs.2022.01.181199(1432-1437)Online publication date: 2022
https://doi.org/10.1016/j.procs.2022.01.181
Bhaskar SThasleema T(2021)CNN Based Feature Extraction for Visual Speech Recognition in MalayalamProceedings of Data Analytics and Management10.1007/978-981-16-6285-0_1(1-8)Online publication date: 22-Nov-2021
https://doi.org/10.1007/978-981-16-6285-0_1
Hao MMamut MYadikar NAysa AUbul K(2020)A Survey of Research on Lipreading TechnologyIEEE Access10.1109/ACCESS.2020.30368658(204518-204544)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3036865
de Melo Langoni VGonzaga A(2019)Evaluating dynamic texture descriptors to recognize human iris in video image sequencePattern Analysis and Applications10.1007/s10044-019-00836-wOnline publication date: 26-Jul-2019
https://doi.org/10.1007/s10044-019-00836-w
Sieranoja SSahidullah MKinnunen TKomulainen JHadid A(2018)Audiovisual Synchrony Detection with Optimized Audio Features2018 IEEE 3rd International Conference on Signal and Image Processing (ICSIP)10.1109/SIPROCESS.2018.8600424(377-381)Online publication date: Jul-2018
https://doi.org/10.1109/SIPROCESS.2018.8600424
Xu KLi DCassimatis NWang X(2018)LCANet: End-to-End Lipreading with Cascaded Attention-CTC2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018)10.1109/FG.2018.00088(548-555)Online publication date: May-2018
https://doi.org/10.1109/FG.2018.00088
Zhang ZSu JKlassen ELe HSrivastava A(2018)Rate-Invariant Analysis of Covariance TrajectoriesJournal of Mathematical Imaging and Vision10.1007/s10851-018-0814-060:8(1306-1323)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1007/s10851-018-0814-0
Komulainen JAnina IHolappa JBoutellaa EHadid A(2016)On the robustness of audiovisual liveness detection to visual speech animation2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS)10.1109/BTAS.2016.7791161(1-8)Online publication date: Sep-2016
https://doi.org/10.1109/BTAS.2016.7791161
Su JSrivastava ASouza FSarkar S(2014)Rate-Invariant Analysis of Trajectories on Riemannian Manifolds with Application in Visual Speech RecognitionProceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition10.1109/CVPR.2014.86(620-627)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1109/CVPR.2014.86
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten