skip to main content
10.1145/1322192.1322231acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
poster

Temporal filtering of visual speech for audio-visual speech recognition in acoustically and visually challenging environments

Published: 12 November 2007 Publication History

Abstract

The use of visual information of speech has been shown to be effective for compensating for performance degradation of acoustic speech recognition in noisy environments. However, visual noise is usually ignored in most of audio-visual speech recognition systems, while it can be included in visual speech signals during acquisition or transmission of the signals. In this paper, we present a new temporal filtering technique for extraction of noise-robust visual features. In the proposed method, a carefully designed band-pass filter is applied to the temporal pixel value sequences of lip region images in order to remove unwanted temporal variations due to visual noise, illumination conditions or speakers' appearances. We demonstrate that the method can improve not only visual speech recognition performance for clean and noisy images but also audio-visual speech recognition performance in both acoustically and visually noisy conditions.

References

[1]
Dupont, S., Luettin, J. Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia, 2(3):141--151, Sept. 2000.
[2]
Huang, J., Potamianos, G., Connell, J., Neti, C. Audio-visual speech recognition using an infrared headset. Speech Communication, 44(1--4):83--96, 2004.
[3]
Hazen, T. J. Visual model structures and synchrony constraints for audio-visual speech recognition. IEEE Trans. Audio, Speech, and Language Processing, 14(3):1082--1089, May 2006.
[4]
Ross, L. A., Saint-Amour, D., Leavitt, V. M., Foxe, J. J. Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. Cerebral Cortex, 17(5):1147--1153, 2007.
[5]
Arnold, P., Hill, F. Bisensory augmentation: a speechreading advantage when speech is clearly audible and intact. British J. Psychology, 92:339--355, 2001.
[6]
McGurk, H., MacDonald, J. Hearing lips and seeing voices. Nature, 264(5588):746--748, Dec. 1976.
[7]
Summerfield, A. Q. Some preliminaries to a comprehensive account of audio-visual speech perception. In Hearing by Eye: The Psychology of Lip-reading, Dodd, B., Campbell, R. (eds.), Lawrence Erlbarum, London, UK, 1987, 3--51.
[8]
Chibelushi, C. C., Deravi, F., and Mason, J. S. D. A review of speech-based bimodal recognition. IEEE Trans. Multimedia, 4(1):23--27, Mar. 2002.
[9]
Gonzalez, R. C., Woods, R. E. Digital Image Processing. Prentice-Hall, Upper Saddle River, NJ, 2002.
[10]
Potamianos, G., Neti, C. Audio-visual speech recognition in challenging environments. In Proc. European Conf. Speech Communication and Technology (Geneva, Switzerland, 2003), 1293--1296.
[11]
Saenko, K., Darrell, T., Glass, J. Articulatory features for robust visual speech recognition. In Proc. Int. Conf. Multimodal Interfaces (State College, PA, 2004), 152--158.
[12]
Lee, J.-S. and Park, C. H. Training hidden Markov models by hybrid simulated annealing for visual speech recognition. in Proc. IEEE Int. Conf. Systems, Man, and Cybernetics (Taipei, Taiwan, Oct. 2006), 198--202.
[13]
Huang, X.-D., Acero, A., Hon, H.-W. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice-Hall, Upper Saddle River, NJ, 2001.
[14]
Gurbuz, S., Tufekci, Z., Patterson, E., Gowdy, J. Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition. In Proc. Int. Conf. Acoustics, Speech and Signal Processing (Salt Lake City, 2001), 177--180.
[15]
Kaynak, M. N., Zhi, Q., Cheok, A. D., Sengupta, K., Jiang, Z., Chung, K. C. Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis. Speech Communication, 43(1--2):1--16, 2004.
[16]
Lucey, S. An evaluation of visual speech features for the tasks of speech and speaker recognition. In Proc. Int. Conf. Audio-- and Video-Based Biometric Person Authentication (Guildford, UK, 2003), 260--267.
[17]
Scanlon, P., Reilly, R. Features analysis for automatic speechreading. In Proc. Int. Conf. Multimedia and Expo (Tokyo, Japan, 2001), 625--630.
[18]
Lee, J.-S., Shim, S. H., Kim, S. Y., Park, C. H. Bimodal speech recognition using robust feature extraction of lip movement under uncontrolled illumination conditions. Telecommunications Review, 14(1):123--134, Feb. 2004.
[19]
Bregler, C., Konig, Y. Eigenlips for robust speech recognition. In Proc. Int. Conf. Acoustics, Speech, and Signal Processing (Adelaide, Australia, 1994), 669--672.
[20]
Nakamura, S. Statistical multimodal integration for audio-visual speech processing. IEEE Trans. Neural Networks, 13(4):854--866, July 2002.
[21]
Benoît, C. The intrinsic bimodality of speech communication and the synthesis of talking faces. In The Structure of Multimodal Dialogue II, Taylor, M. M., Nel, F., Bouwhuis, D. (eds.), John Benjamins, Amsterdam, The Netherlands, 2000, 485--502.
[22]
Rogozan, A., Deléglise, P. Adaptive fusion of acoustic and visual sources for automatic speech recognition. Speech Communication, 26(1--2):149--161, Oct. 1998.
[23]
Lewis, T. W., Powers, D. M. W. Sensor fusion weighting measures in audio-visual speech recognition. In Proc. Conf. Australasian Computer Science (Dunedin, New Zealand, 2004), 305--314.
[24]
Munhall, K., Vatikiotis-Bateson, E. The moving face during speech communication. In Hearing by Eye II: Advances in the Psychology of Speechreading and Audio-Visual Speech, Campbell, R., Dodd, B., Burnham, D. (eds.), Psychology Press, Hove, UK, 1998, 123--142.
[25]
Oppenheim, A. V., Schafer, W. W. Discrete-Time Signal Processing. Prentice-Hall, Upper Saddle River, NJ, 1999.
[26]
Weeks Jr., A. R. Fundamentals of Electronic Image Processing. SPIE/IEEE Press, Bellingham, WA, 1996.
[27]
Vitkovitch, M., Barber, P. Visible speech as a function of image quality: effects of display parameters on lipreading ability. Applied Cognitive Psychology 10(2):121--140, 1996.
[28]
Jung, H.-Y., Lee, S.-Y. On the temporal decorrelation of feature parameters for noise-robust speech recognition. IEEE Trans. Speech and Audio Processing 8(4):407--416, 2000.
[29]
Hermansky, H., Morgan, N. RASTA processing of speech, IEEE Trans. Speech and Audio Processing 2(4):578--589, 1994.
[30]
Bishop, C. Neural Networks for Pattern Recognition. Oxford University Press, UK, 1995.

Index Terms

  1. Temporal filtering of visual speech for audio-visual speech recognition in acoustically and visually challenging environments

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMI '07: Proceedings of the 9th international conference on Multimodal interfaces
      November 2007
      402 pages
      ISBN:9781595938176
      DOI:10.1145/1322192
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 November 2007

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. audio-visual speech recognition
      2. feature extraction
      3. hidden Markov model
      4. late integration
      5. neural network
      6. noise-robustness
      7. temporal filtering

      Qualifiers

      • Poster

      Conference

      ICMI07
      Sponsor:
      ICMI07: International Conference on Multimodal Interface
      November 12 - 15, 2007
      Aichi, Nagoya, Japan

      Acceptance Rates

      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 311
        Total Downloads
      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 17 Feb 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media