skip to main content
10.1145/1452392.1452442acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition

Published: 20 October 2008 Publication History

Abstract

Merging decisions from different modalities is a crucial problem in Audio-Visual Speech Recognition. To solve this, state synchronous multi-stream HMMs have been proposed for their important advantage of incorporating stream reliability in their fusion scheme. This paper focuses on stream weight adaptation based on modality confidence estimators. We assume different and time-varying environment noise, as can be encountered in realistic applications, and, for this, adaptive methods are best suited. Stream reliability is assessed directly through classifier outputs since they are not specific to either noise type or level. The influence of constraining the weights to sum to one is also discussed.

References

[1]
H. Bourlard and S. Dupont. A new ASR approach based on independent processing and recombination of partial frequency bands. Proc. International Conference on Spoken Language Processing, pages 426--429, 1996.
[2]
S. Dupont and J. Luettin. Audio-visual speech modeling for continuous speech recognition. In IEEE Transactions on Multimedia, volume 2, pages 141--151, Sept. 2000.
[3]
H. Glotin, D. Vergyri, C. Neti, G. Potamianos, and J. Luettin. Weighting schemes for audio-visual fusion in speech recognition. In ICASSP01, Salt Lake City, USA, volume 1, pages 173--176, May 2001.
[4]
G. Gravier, S. Axelrod, G. Potamianos, and C. Neti. Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR. Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2002.
[5]
H. Misra. Multi-stream processing for noise robust speech recognition. PhD dissertation, EPFL, Lausanne, May 2006.
[6]
E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy. Moving-talker, speaker--independent feature study and baseline results using the CUAVE multimodal speech corpus. EURASIP JASP, 2002(11):1189--1201, 2002.
[7]
G. Potamianos and C. Neti. Stream confidence estimation for audio-visual speech recognition. In ICSLP00, Beijing, China, volume 3, pages 746--749, 2000.
[8]
E. Sanchez-Soto, A. Potamianos, and K. Daoudi. Unsupervised stream weight computation using anti-models. In ICASSP07, Hawaii, USA, April 2007.
[9]
P. Scanlon and G. Potamianos. Exploiting lower face symmetry in appearance-based automatic speechreading. Proc. Works. Audio-Visual Speech Process. (AVSP), pages 79--84, 2005.
[10]
R. Seymour, D. Stewart, and J. Ming. Audio-visual integration for robust speech recognition using maximum weighted stream posteriors. Proc. INTERSPEECH, 2007.
[11]
S. Tamura, K. Iwano, and S. Furui. A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization. In ICASSP05, Philadelphia, USA, pages 468--472, March 2005.
[12]
S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland. The HTK Book. Cambridge, Entropic Ltd., 1999.

Cited By

View all
  • (2023)Component attention network for multimodal dance improvisation recognitionProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3614114(114-118)Online publication date: 9-Oct-2023
  • (2022)Reliability-Based Large-Vocabulary Audio-Visual Speech RecognitionSensors10.3390/s2215550122:15(5501)Online publication date: 23-Jul-2022
  • (2022)Understanding Political Polarization via Jointly Modeling Users, Connections and Multimodal Contents on Heterogeneous GraphsProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547898(4072-4082)Online publication date: 10-Oct-2022
  • Show More Cited By

Index Terms

  1. Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMI '08: Proceedings of the 10th international conference on Multimodal interfaces
      October 2008
      322 pages
      ISBN:9781605581989
      DOI:10.1145/1452392
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 October 2008

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. audio-visual speech recognition
      2. multi-stream hmm
      3. multimodal fusion
      4. stream reliability

      Qualifiers

      • Research-article

      Conference

      ICMI '08
      Sponsor:
      ICMI '08: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES
      October 20 - 22, 2008
      Crete, Chania, Greece

      Acceptance Rates

      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)14
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 09 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Component attention network for multimodal dance improvisation recognitionProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3614114(114-118)Online publication date: 9-Oct-2023
      • (2022)Reliability-Based Large-Vocabulary Audio-Visual Speech RecognitionSensors10.3390/s2215550122:15(5501)Online publication date: 23-Jul-2022
      • (2022)Understanding Political Polarization via Jointly Modeling Users, Connections and Multimodal Contents on Heterogeneous GraphsProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547898(4072-4082)Online publication date: 10-Oct-2022
      • (2022)Multimodal Fusion Remote Sensing Image–Audio RetrievalIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2022.319407615(6220-6235)Online publication date: 2022
      • (2022)A new multi-stream approach using acoustic and visual features for robust speech recognition systemMaterials Today: Proceedings10.1016/j.matpr.2022.03.53762(4916-4924)Online publication date: 2022
      • (2021)Multimodal Integration for Large-Vocabulary Audio-Visual Speech Recognition2020 28th European Signal Processing Conference (EUSIPCO)10.23919/Eusipco47968.2020.9287841(341-345)Online publication date: 24-Jan-2021
      • (2021)Audio-Visual Transformer Based Crowd Counting2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW54120.2021.00254(2249-2259)Online publication date: Oct-2021
      • (2020)Emotions Don't LieProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413570(2823-2832)Online publication date: 12-Oct-2020
      • (2020)Multimedia Intelligence: When Multimedia Meets Artificial IntelligenceIEEE Transactions on Multimedia10.1109/TMM.2020.296979122:7(1823-1835)Online publication date: Jul-2020
      • (2020)Emotional Analysis of Sentences Based on Machine LearningBig Data Analytics for Cyber-Physical System in Smart City10.1007/978-981-15-2568-1_111(813-820)Online publication date: 12-Jan-2020
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media