Article

Co-Adaptation of audio-visual speech and gesture classifiers

Authors:

C. Mario Christoudias,

Kate Saenko,

Louis-Philippe Morency,

Trevor DarrellAuthors Info & Claims

ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces

Pages 84 - 91

https://doi.org/10.1145/1180995.1181013

Published: 02 November 2006 Publication History

Get Access

Abstract

The construction of robust multimodal interfaces often requires large amounts of labeled training data to account for cross-user differences and variation in the environment. In this work, we investigate whether unlabeled training data can be leveraged to build more reliable audio-visual classifiers through co-training, a multi-view learning algorithm. Multimodal tasks are good candidates for multi-view learning, since each modality provides a potentially redundant view to the learning algorithm. We apply co-training to two problems: audio-visual speech unit classification, and user agreement recognition using spoken utterances and head gestures. We demonstrate that multimodal co-training can be used to learn from only a few labeled examples in one or both of the audio-visual modalities. We also propose a co-adaptation algorithm, which adapts existing audio-visual classifiers to a particular user or noise condition by leveraging the redundancy in the unlabeled data.

References

[1]

S. Bickel and T. Scheffer. Estimation of mixture models using co-em. In Proceedings of ICML Workshop on Learning with Multiple Views, 2005.]]

Digital Library

Google Scholar

[2]

A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, pages 92--100, 1998.]]

Digital Library

Google Scholar

[3]

M. Collins and Y. Singer. Unsupervised models for named entity classification, 1999.]]

Google Scholar

[4]

B. Efron and R. Tibshirani. An Introduction to the Boot-strap. Chapman and Hall, 1993.]]

Crossref

Google Scholar

[5]

T. J. Hazen, K. Saenko, C. H. La, and J. Glass. A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In Proc. ICMI, 2005.]]

Digital Library

Google Scholar

[6]

D. Hillard, M. Ostendorf, and E. Shriberg. Detection of agreement vs. disagreement in meetings: Training with unlabeled data. In HLT, 2003.]]

Digital Library

Google Scholar

[7]

J. Huang, E. Marcheret, and K. Visweswariah. Rapid feature space speaker adaptation for multi-stream hmm-based audio-visual speech recognition. In ICME, 2005.]]

Crossref

Google Scholar

[8]

A. Levin, P. Viola, and Y. Freund. Unsupervised improvement of visual detectors using cotraining, 2003.]]

Google Scholar

[9]

T. Li and M. Ogihara. Semi-supervised learning from different information sources. Knowledge Information Systems Journal, 7(3):289--309, 2005.]]

Digital Library

Google Scholar

[10]

L.-P. Morency, A. Rahimi, and T. Darrell. Adaptive view-based appearance model. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, volume 1, pages 803--810, 2003.]]

Crossref

Google Scholar

[11]

S. Pan, S. Shen, M. X. Zhou, and K. Houck. Two-way adaptation for robust input interpretation in practical multimodal conversation systems. In IUI '05: Proceedings of the 10th international conference on Intelligent user interfaces, New York, NY, USA, 2005. ACM Press.]]

Digital Library

Google Scholar

[12]

V. Sindhwani, P. Niyogi, and M. Belkin. A co-regularization approach to semi-supervised learning with multiple views. In Proceedings of ICML Workshop on Learning with Multiple Views, 2005.]]

Google Scholar

[13]

B. Xiao, R. Lunsford, R. Coulston, M. Wesson, and S. L. Oviatt. Modeling multimodal integration patterns and performance in seniors: Toward adaptive processing of individual differences. In Proceedings of International Conference on Multimodal Interfaces, 2003.]]

Digital Library

Google Scholar

[14]

R. Yan and M. Naphade. Semi-supervised cross feature learning for semantic concept detection in videos. In Computer Vision and Pattern Recognition, pages 657--663, June 2005.]]

Digital Library

Google Scholar

Cited By

View all

Gong WYu QSun HHuang WCheng PGonzàlez J(2024)MCLEMCD: multimodal collaborative learning encoder for enhanced music classification from dancesMultimedia Systems10.1007/s00530-023-01207-630:1Online publication date: 22-Jan-2024
https://doi.org/10.1007/s00530-023-01207-6
Jannat MYang XHasan K(2023)Around-device finger input on commodity smartwatches with learning guidance through discoverabilityInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2023.103105179:COnline publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1016/j.ijhcs.2023.103105
Sun XXu YCao PKong YHu LZhang SWang Y(2020)TCGM: An Information-Theoretic Framework for Semi-supervised Multi-modality LearningComputer Vision – ECCV 202010.1007/978-3-030-58580-8_11(171-188)Online publication date: 3-Dec-2020
https://doi.org/10.1007/978-3-030-58580-8_11
Show More Cited By

Index Terms

Co-Adaptation of audio-visual speech and gesture classifiers
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
    2. Natural language processing
      1. Speech recognition
  2. Machine learning

Recommendations

Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Stacked co-training for semi-supervised multi-label learning
Abstract
Due to the difficulty of annotation, multi-label learning sometimes obtains a small amount of labeled data and a large amount of unlabeled data as supplements. To make up this issue, many algorithms extended the existing semi-supervised ...
Self-paced multi-label co-training
Abstract
Multi-label learning aims to solve classification problems where instances are associated with a set of labels. In reality, it is generally easy to acquire unlabeled data but expensive or time-consuming to label them, and this ...

Comments

Information & Contributors

Information

Published In

ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces

November 2006

404 pages

ISBN:159593541X

DOI:10.1145/1180995

General Chairs:
Francis Quek
Virginia Tech, USA
,
Jie Yang
Carnegie Mellon University, USA
,
Program Chairs:
Dominic Massaro
University of California, Santa Cruz, USA
,
Abeer Alwan
University of California, Los Angeles, USA
,
Timothy J. Hazen
Massachusetts Institute of Technology, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICMI06

Sponsor:

ICMI06: 8th International Conference on Multimodal Interfaces 2006

November 2 - 4, 2006

Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
398
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Gong WYu QSun HHuang WCheng PGonzàlez J(2024)MCLEMCD: multimodal collaborative learning encoder for enhanced music classification from dancesMultimedia Systems10.1007/s00530-023-01207-630:1Online publication date: 22-Jan-2024
https://doi.org/10.1007/s00530-023-01207-6
Jannat MYang XHasan K(2023)Around-device finger input on commodity smartwatches with learning guidance through discoverabilityInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2023.103105179:COnline publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1016/j.ijhcs.2023.103105
Sun XXu YCao PKong YHu LZhang SWang Y(2020)TCGM: An Information-Theoretic Framework for Semi-supervised Multi-modality LearningComputer Vision – ECCV 202010.1007/978-3-030-58580-8_11(171-188)Online publication date: 3-Dec-2020
https://doi.org/10.1007/978-3-030-58580-8_11
Baltrusaitis TAhuja CMorency L(2019)Multimodal Machine LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2018.279860741:2(423-443)Online publication date: 10-Dec-2019
https://dl.acm.org/doi/10.1109/TPAMI.2018.2798607
Baltrušaitis TAhuja CMorency L(2018)Challenges and applications in multimodal machine learningThe Handbook of Multimodal-Multisensor Interfaces10.1145/3107990.3107993(17-48)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1145/3107990.3107993
Caridakis GKarpouzis KKollias S(2018)User and context adaptive neural networks for emotion recognitionNeurocomputing10.1016/j.neucom.2007.11.04371:13-15(2553-2562)Online publication date: 31-Dec-2018
https://dl.acm.org/doi/10.1016/j.neucom.2007.11.043
Brutti ACavallaro A(2017)Unsupervised Cross-Modal Deep-Model Adaptation for Audio-Visual Re-identification with Wearable Cameras2017 IEEE International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW.2017.59(438-445)Online publication date: Oct-2017
https://doi.org/10.1109/ICCVW.2017.59
Brutti ACavallaro A(2016)Online Cross-Modal Adaptation for Audio–Visual Person Identification With Wearable CamerasIEEE Transactions on Human-Machine Systems10.1109/THMS.2016.2620110(1-12)Online publication date: 2016
https://doi.org/10.1109/THMS.2016.2620110
Katsaggelos ABahaadini SMolina R(2015)Audiovisual Fusion: Challenges and New ApproachesProceedings of the IEEE10.1109/JPROC.2015.2459017103:9(1635-1653)Online publication date: Sep-2015
https://doi.org/10.1109/JPROC.2015.2459017
Hwang SGrauman KSha F(2012)Semantic kernel forests from multiple taxonomiesProceedings of the 26th International Conference on Neural Information Processing Systems - Volume 210.5555/2999325.2999327(1718-1726)Online publication date: 3-Dec-2012
https://dl.acm.org/doi/10.5555/2999325.2999327
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Inductive Semi-supervised Multi-Label Learning with Co-Training

Stacked co-training for semi-supervised multi-label learning

Self-paced multi-label co-training

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations