skip to main content
10.1145/1180995.1181013acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
Article

Co-Adaptation of audio-visual speech and gesture classifiers

Published: 02 November 2006 Publication History

Abstract

The construction of robust multimodal interfaces often requires large amounts of labeled training data to account for cross-user differences and variation in the environment. In this work, we investigate whether unlabeled training data can be leveraged to build more reliable audio-visual classifiers through co-training, a multi-view learning algorithm. Multimodal tasks are good candidates for multi-view learning, since each modality provides a potentially redundant view to the learning algorithm. We apply co-training to two problems: audio-visual speech unit classification, and user agreement recognition using spoken utterances and head gestures. We demonstrate that multimodal co-training can be used to learn from only a few labeled examples in one or both of the audio-visual modalities. We also propose a co-adaptation algorithm, which adapts existing audio-visual classifiers to a particular user or noise condition by leveraging the redundancy in the unlabeled data.

References

[1]
S. Bickel and T. Scheffer. Estimation of mixture models using co-em. In Proceedings of ICML Workshop on Learning with Multiple Views, 2005.]]
[2]
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, pages 92--100, 1998.]]
[3]
M. Collins and Y. Singer. Unsupervised models for named entity classification, 1999.]]
[4]
B. Efron and R. Tibshirani. An Introduction to the Boot-strap. Chapman and Hall, 1993.]]
[5]
T. J. Hazen, K. Saenko, C. H. La, and J. Glass. A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In Proc. ICMI, 2005.]]
[6]
D. Hillard, M. Ostendorf, and E. Shriberg. Detection of agreement vs. disagreement in meetings: Training with unlabeled data. In HLT, 2003.]]
[7]
J. Huang, E. Marcheret, and K. Visweswariah. Rapid feature space speaker adaptation for multi-stream hmm-based audio-visual speech recognition. In ICME, 2005.]]
[8]
A. Levin, P. Viola, and Y. Freund. Unsupervised improvement of visual detectors using cotraining, 2003.]]
[9]
T. Li and M. Ogihara. Semi-supervised learning from different information sources. Knowledge Information Systems Journal, 7(3):289--309, 2005.]]
[10]
L.-P. Morency, A. Rahimi, and T. Darrell. Adaptive view-based appearance model. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, volume 1, pages 803--810, 2003.]]
[11]
S. Pan, S. Shen, M. X. Zhou, and K. Houck. Two-way adaptation for robust input interpretation in practical multimodal conversation systems. In IUI '05: Proceedings of the 10th international conference on Intelligent user interfaces, New York, NY, USA, 2005. ACM Press.]]
[12]
V. Sindhwani, P. Niyogi, and M. Belkin. A co-regularization approach to semi-supervised learning with multiple views. In Proceedings of ICML Workshop on Learning with Multiple Views, 2005.]]
[13]
B. Xiao, R. Lunsford, R. Coulston, M. Wesson, and S. L. Oviatt. Modeling multimodal integration patterns and performance in seniors: Toward adaptive processing of individual differences. In Proceedings of International Conference on Multimodal Interfaces, 2003.]]
[14]
R. Yan and M. Naphade. Semi-supervised cross feature learning for semantic concept detection in videos. In Computer Vision and Pattern Recognition, pages 657--663, June 2005.]]

Cited By

View all
  • (2024)MCLEMCD: multimodal collaborative learning encoder for enhanced music classification from dancesMultimedia Systems10.1007/s00530-023-01207-630:1Online publication date: 22-Jan-2024
  • (2023)Around-device finger input on commodity smartwatches with learning guidance through discoverabilityInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2023.103105179:COnline publication date: 1-Nov-2023
  • (2020)TCGM: An Information-Theoretic Framework for Semi-supervised Multi-modality LearningComputer Vision – ECCV 202010.1007/978-3-030-58580-8_11(171-188)Online publication date: 3-Dec-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces
November 2006
404 pages
ISBN:159593541X
DOI:10.1145/1180995
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adaptation
  2. audio-visual speech and gesture
  3. co-training
  4. human-computer interfaces
  5. semi-supervised learning

Qualifiers

  • Article

Conference

ICMI06
Sponsor:

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)2
Reflects downloads up to 09 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MCLEMCD: multimodal collaborative learning encoder for enhanced music classification from dancesMultimedia Systems10.1007/s00530-023-01207-630:1Online publication date: 22-Jan-2024
  • (2023)Around-device finger input on commodity smartwatches with learning guidance through discoverabilityInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2023.103105179:COnline publication date: 1-Nov-2023
  • (2020)TCGM: An Information-Theoretic Framework for Semi-supervised Multi-modality LearningComputer Vision – ECCV 202010.1007/978-3-030-58580-8_11(171-188)Online publication date: 3-Dec-2020
  • (2019)Multimodal Machine LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2018.279860741:2(423-443)Online publication date: 10-Dec-2019
  • (2018)Challenges and applications in multimodal machine learningThe Handbook of Multimodal-Multisensor Interfaces10.1145/3107990.3107993(17-48)Online publication date: 1-Oct-2018
  • (2018)User and context adaptive neural networks for emotion recognitionNeurocomputing10.1016/j.neucom.2007.11.04371:13-15(2553-2562)Online publication date: 31-Dec-2018
  • (2017)Unsupervised Cross-Modal Deep-Model Adaptation for Audio-Visual Re-identification with Wearable Cameras2017 IEEE International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW.2017.59(438-445)Online publication date: Oct-2017
  • (2016)Online Cross-Modal Adaptation for Audio–Visual Person Identification With Wearable CamerasIEEE Transactions on Human-Machine Systems10.1109/THMS.2016.2620110(1-12)Online publication date: 2016
  • (2015)Audiovisual Fusion: Challenges and New ApproachesProceedings of the IEEE10.1109/JPROC.2015.2459017103:9(1635-1653)Online publication date: Sep-2015
  • (2012)Semantic kernel forests from multiple taxonomiesProceedings of the 26th International Conference on Neural Information Processing Systems - Volume 210.5555/2999325.2999327(1718-1726)Online publication date: 3-Dec-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media