skip to main content
10.1145/1290128.1290138acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
Article

Local spatiotemporal descriptors for visual recognition of spoken phrases

Published: 28 September 2007 Publication History

Abstract

Visual speech information plays an important role in speech recognition under noisy conditions or for listeners with hearing impairment. In this paper, we propose local spatiotemporal descriptors to represent and recognize spoken isolated phrases based solely on visual input. Positions of the eyes determined by a robust face and eye detector are used for localizing the mouth regions in face images. Spatiotemporal local binary patterns extracted from these regions are used for describing phrase sequences. In our experiments with 817 sequences from ten phrases and 20 speakers, promising accuracies of 62% and 70% were obtained in speaker-independent and speaker-dependent recognition, respectively. In comparison with other methods on the Tulips1 audio-visual database, the accuracy 92.7% of our method clearly out performs the others. Advantages of our approach include local processing and robustness to monotonic gray-scale changes. Moreover, no error prone segmentation of moving lips is needed.

References

[1]
Fox N., Gross R. and Chazal P. Person identification using automatic integration of speech, lip and face experts. ACM SIGMM workshop on Biometrics Methods and Applications, 2003, 25--32.
[2]
Frischholz R.W. and Dieckmann U. BioID: a multimodal biometric identification system. Computer, 33(2), 2000, 64--68.
[3]
Luettin J., Thacher N.A. and Beet S.W. Speaker identification by lipreading. International Conference on Spoken Language Proceedings (ICSLP), 1996, 62--64.
[4]
Potamianos G., Neti C., Gravier G., Garg A., and Senior A. Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE, 2003.
[5]
Arsic I. and Thiran J.P. Mutual information engenlips for audio-visual speech. 14th European Signal Processing Conference, Italy, 2006.
[6]
Gurban M. and Thiran J.P. Audio-visual speech recognition with a hybrid SVM-HMM system. 13th European Signal Processing Conference, 2005.
[7]
Lee B., Hasegawa-Johnson M., Goudeseune C., Kamdar S., Borys S., Liu M. and Huang T. AVICAR: Audio-visual speech corpus in a car environment. ICSLP, 2004, 2489--2492.
[8]
Gowdy J.N., Subramanya A., Bartels C. and Bilmes J. DBN based multi-stream models for audio-visual speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, Canada, 2004, 993--996.
[9]
Aleksic P.S. and Katsaggelos A.K. Product HMMs for audio-visual continuous speech recognition using facial animation parameters. International Conference on Multimedia and Expo (ICME), vol. 2, 2003, 481--484.
[10]
Nefian A.V., Liang L., Pi X., Liu X. and Murphy K. Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal on Applied Signal Processing 2002, 11, 1--15.
[11]
Aleksic P.S., Williams J.J., Wu Z., and Katsaggelos A.K. Audio--visual speech recognition using MPEG-4 compliant visual features. EURASIP Journal on Applied Signal Processing 2002, 11, 1213--1227.
[12]
Neti C., Potamianos G., Luettin J., Matthews I., Glotin H., Vergyri D., Sison J., Mashari A. and Zhou J. Audio-visual speech recognition. Center for Language and Speech Processing, The Johns Hopkins University, Baltimore, MD, Final workshop 2000 Report, Oct. 2000.
[13]
Basu S., Neti C., Rajput N., Senior A., Subramaniam L., Verma A. Audio-visual large vocabulary continuous speech recognition in the broadcast domain. IEEE 3rd Workshop on Multimedia Signal Processing, 475--481, 1999.
[14]
Niyogi P., Petajan E. and Zhong J. Feature based representation for audio-visual speech recognition. Audio Visual Speech Conference, 1999.
[15]
Brooke N.M. Using the visual component in automatic speech recognition. ICSLP, Vol. 3, 1996, 1656--1659.
[16]
Duchnowski P., Hunke M., Busching D., Meier U. and Waibel A. Toward movement-invariant automatic lipreading and speech recognition. ICSLP, 109--112, 1995.
[17]
McGurk H. and MacDonald J. Hearing lips and seeing voices. Nature, vol. 264, 1976, 746--748.
[18]
Potamianos G., Neti C., Luettin J., and Matthews I. Audio-visual automatic speech recognition: an overview. Issues in Visual and Audio-Visual Speech Processing, MIT Press, 2004.
[19]
Potamianos G., Graf H. P. and Cosatto E. An image transform approach for HMM based automatic lipreading. Proc. of ICIP 1998, Chicago, Illinois, 1998, 173--177.
[20]
Saenko K., Livescu K., Siracusa M., Wilson K., Glass J. and Darrell T. Visual speech recognition with loosely synchronized feature streams. ICCV, 2005, 1424--1431.
[21]
Saenko K., Livescu K., Glass J., and Darrell T. Production domain modeling of pronunciation for visual speech recognition. ICASSP, vol. 5, 2005, 473--476.
[22]
Chiou G.I. and Hwang J.N. Lipreading from color video. IEEE Transactions on Image Processing, 6(8), 1997, 1192--1195.
[23]
Matthews I., Cootes T.F., Bangham J.A., Cox S., and Harvey R. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 2002, 198--213.
[24]
Ojala T., Pietikäinen M., and Mäenpää T. Multiresolution gray scale and rotation invariant texture analysis with local binary patterns. IEEE Trans. Pattern Analysis and Machine Intelligence, 24(7), 971--987, 2002.
[25]
Ahonen T., Hadid A., and Pietikäinen M. Face description with local binary patterns: Application to face recognition. IEEE Trans. Pattern Analysis and Machine Intelligence, 28(12), 2006, 2037--2041.
[26]
Zhao G. and Pietikäinen M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 2007, 915--928.
[27]
Viola P. and Jones M. Rapid object detection using a boosted cascade of simple features. CVPR, 2001, 511--518.
[28]
Heusch G., Rodriguez Y., and Marcel S. Local binary patterns as an image preprocessing for face authentication. 7th International Conference on Automatic Face and Gesture Recognition (FG2006), 2006, 9--14.
[29]
Messer K., Matas J., Kittler J., Luettin J., and Maitre G. Xm2vtsdb: The extended m2vts database. Second International Conference on Audio and Video-Based Biometric Person Authentication, Washington, D.C., 1999.
[30]
Sanderson C. The VidTIMIT database. IDIAP Communication 02-06, Martigny, Switzerland, 2002.
[31]
Hazen T., Saenko K., La C. H., and Glass J. A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. Proc. ICMI, 2005.
[32]
Movellan J.R. Visual speech recognition with stochastic networks. Advances in Neural Information Processing Systems, vol. 7, 1995, pp. 851--858.

Cited By

View all
  • (2023)Significance of Convolutional Neural Network in View of Lip Reading for Speech-Impaired People2023 3rd International Conference on Emerging Frontiers in Electrical and Electronic Technologies (ICEFEET)10.1109/ICEFEET59656.2023.10452196(1-6)Online publication date: 21-Dec-2023
  • (2022)A Novel Machine Lip Reading ModelProcedia Computer Science10.1016/j.procs.2022.01.181199(1432-1437)Online publication date: 2022
  • (2021)CNN Based Feature Extraction for Visual Speech Recognition in MalayalamProceedings of Data Analytics and Management10.1007/978-981-16-6285-0_1(1-8)Online publication date: 22-Nov-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HCM '07: Proceedings of the international workshop on Human-centered multimedia
September 2007
112 pages
ISBN:9781595937810
DOI:10.1145/1290128
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 September 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. face and eye detection
  2. local spatiotemporal descriptors
  3. mouth region localization
  4. visual speech recognition

Qualifiers

  • Article

Conference

MM07
MM07: The 15th ACM International Conference on Multimedia 2007
September 28, 2007
Bavaria, Augsburg, Germany

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Significance of Convolutional Neural Network in View of Lip Reading for Speech-Impaired People2023 3rd International Conference on Emerging Frontiers in Electrical and Electronic Technologies (ICEFEET)10.1109/ICEFEET59656.2023.10452196(1-6)Online publication date: 21-Dec-2023
  • (2022)A Novel Machine Lip Reading ModelProcedia Computer Science10.1016/j.procs.2022.01.181199(1432-1437)Online publication date: 2022
  • (2021)CNN Based Feature Extraction for Visual Speech Recognition in MalayalamProceedings of Data Analytics and Management10.1007/978-981-16-6285-0_1(1-8)Online publication date: 22-Nov-2021
  • (2020)A Survey of Research on Lipreading TechnologyIEEE Access10.1109/ACCESS.2020.30368658(204518-204544)Online publication date: 2020
  • (2019)Evaluating dynamic texture descriptors to recognize human iris in video image sequencePattern Analysis and Applications10.1007/s10044-019-00836-wOnline publication date: 26-Jul-2019
  • (2018)Audiovisual Synchrony Detection with Optimized Audio Features2018 IEEE 3rd International Conference on Signal and Image Processing (ICSIP)10.1109/SIPROCESS.2018.8600424(377-381)Online publication date: Jul-2018
  • (2018)LCANet: End-to-End Lipreading with Cascaded Attention-CTC2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018)10.1109/FG.2018.00088(548-555)Online publication date: May-2018
  • (2018)Rate-Invariant Analysis of Covariance TrajectoriesJournal of Mathematical Imaging and Vision10.1007/s10851-018-0814-060:8(1306-1323)Online publication date: 1-Oct-2018
  • (2016)On the robustness of audiovisual liveness detection to visual speech animation2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS)10.1109/BTAS.2016.7791161(1-8)Online publication date: Sep-2016
  • (2014)Rate-Invariant Analysis of Trajectories on Riemannian Manifolds with Application in Visual Speech RecognitionProceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition10.1109/CVPR.2014.86(620-627)Online publication date: 23-Jun-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media