Article

EM detection of common origin of multi-modal cues

Authors:

B. J. A. KröseAuthors Info & Claims

ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces

Pages 201 - 208

https://doi.org/10.1145/1180995.1181037

Published: 02 November 2006 Publication History

Abstract

Content analysis of clips containing people speaking involves processing informative cues coming from different modalities. These cues are usually the words extracted from the audio modality, and the identity of the persons appearing in the video modality of the clip. To achieve efficient assignment of these cues to the person that created them, we propose a Bayesian network model that utilizes the extracted feature characteristics, their relations and their temporal patterns. We use the EM algorithm in which the E-step estimates the expectation of the complete-data log-likelihood with respect to the hidden variables - that is the identity of the speakers and the visible persons. In the M-step, the person models that maximize this expectation are computed. This framework produces excellent results, exhibiting exceptional robustness when dealing with low quality data.

References

[1]

J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, 1997.

[2]

O. Chapelle, P. Haffner, and V. Vapnik. Svms for histogram-based image classification, 1999.

[3]

S. Chen and P. Gopalakrishnan. Speaker, environment and channel change detection and clustering via the bayesian information criterion, 1998.

[4]

T. Darrell, J. W. Fisher, III, P. Viola, and W. Freeman. Audio-visual segmentation and "the cocktail party effect".

[5]

J. G. Fiscus, N. Radde, J. S. Garofolo, A. Le, J. Ajot, and C. Laprun. Rich transcription 2005 spring meeting recognition evaluation, 2005.

[6]

J. W. F. III and T. Darrel. Probabilistic models and informative subspaces for audiovisual correspondence. TUGboat, 12(2):291--301, June 1991.

[7]

J. W. F. III and T. Darrell. Probabalistic models and informative subspaces for audiovisual correspondence. In ECCV (3), pages 592--603, 2002.

Digital Library

[8]

J. W. F. III, T. Darrell, W. T. Freeman, and P. A. Viola. Learning joint statistical models for audio-visual fusion and segregation. In NIPS, pages 772--778, 2000.

[9]

L. Lu and H.-J. Zhang. Speaker change detection and tracking in real-time news broadcasting analysis. In MULTIMEDIA '02: Proceedings of the tenth ACM international conference on Multimedia, pages 602--610, New York, NY, USA, 2002. ACM Press.

Digital Library

[10]

J. Ming, J. Lin, and F. J. Smith. A posterior union model with applications to robust speech and speaker recognition. EURASIP Journal on Applied Signal Processing, 2006, December 2006.

Digital Library

[11]

B. Moghaddam and A. Pentland. Face recognition using view-based and modular eigenspaces. In Automatic Systems for the Identification and Inspection of Humans, SPIE'94, volume 2257, 1994.

[12]

H. J. Nock, G. Iyengar, and C. Neti. Multimodal processing by finding common cause. Commun. ACM, 47(1):51--56, 2004.

Digital Library

[13]

L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Apllications in Speech Recognition. Kaufmann, San Mateo, CA, 1990.

[14]

K. Saenko, K. Livescu, M. Siracusa, K. Wilson, J. Glass, and T. Darrell. Visual speech recognition with loosely synchronized feature streams. In ICCV '05: Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 2, pages 1424--1431, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[15]

M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 586--591, 1991.

[16]

P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 2002.

Digital Library

[17]

M.-H. Yang, D. J. Kriegman, and N. Ahuja. Detecting faces in images: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 24(1):34--58, 2002.

Digital Library

[18]

W. Zhao, R. Chellappa, A. Rosenfeld, and P. Phillips. Face recognition: A literature survey, 2000.

Cited By

Chakraborty RNaskar R(2024)Role of human physiology and facial biomechanics towards building robust deepfake detectors: A comprehensive survey and analysisComputer Science Review10.1016/j.cosrev.2024.10067754(100677)Online publication date: Nov-2024
https://doi.org/10.1016/j.cosrev.2024.100677
Karthick RDawood MMeenalochini P(2023)Analysis of vital signs using remote photoplethysmography (RPPG)Journal of Ambient Intelligence and Humanized Computing10.1007/s12652-023-04683-w14:12(16729-16736)Online publication date: 8-Sep-2023
https://doi.org/10.1007/s12652-023-04683-w
Biradar HGawande J(2022)Heart Rate Estimation from Facial Video Sequences using Fast Independent Component Analysis2022 National Conference on Communications (NCC)10.1109/NCC55593.2022.9806810(88-93)Online publication date: 24-May-2022
https://doi.org/10.1109/NCC55593.2022.9806810
Show More Cited By

Index Terms

EM detection of common origin of multi-modal cues
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

On-line multi-modal speaker diarization
ICMI '07: Proceedings of the 9th international conference on Multimodal interfaces

This paper presents a novel framework that utilizes multi-modal information to achieve speaker diarization. We use dynamic Bayesian networks to achieve on-line results. We progress from a simple observation model to a complex multi-modal one as more ...
Focusing computational visual attention in multi-modal human-robot interaction
ICMI-MLMI '10: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction

Identifying verbally and non-verbally referred-to objects is an important aspect of human-robot interaction. Most importantly, it is essential to achieve a joint focus of attention and, thus, a natural interaction behavior. In this contribution, we ...
Real-time Emotion Pre-Recognition in Conversations with Contrastive Multi-modal Dialogue Pre-training
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

This paper presents our pioneering effort in addressing a new and realistic scenario in multi-modal dialogue systems called Multi-modal Real-time Emotion Pre-recognition in Conversations (MREPC). The objective is to predict the emotion of a forthcoming ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces

November 2006

404 pages

ISBN:159593541X

DOI:10.1145/1180995

General Chairs:
Francis Quek
Virginia Tech, USA
,
Jie Yang
Carnegie Mellon University, USA
,
Program Chairs:
Dominic Massaro
University of California, Santa Cruz, USA
,
Abeer Alwan
University of California, Los Angeles, USA
,
Timothy J. Hazen
Massachusetts Institute of Technology, USA

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICMI06

Sponsor:

ICMI06: 8th International Conference on Multimodal Interfaces 2006

November 2 - 4, 2006

Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
258
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chakraborty RNaskar R(2024)Role of human physiology and facial biomechanics towards building robust deepfake detectors: A comprehensive survey and analysisComputer Science Review10.1016/j.cosrev.2024.10067754(100677)Online publication date: Nov-2024
https://doi.org/10.1016/j.cosrev.2024.100677
Karthick RDawood MMeenalochini P(2023)Analysis of vital signs using remote photoplethysmography (RPPG)Journal of Ambient Intelligence and Humanized Computing10.1007/s12652-023-04683-w14:12(16729-16736)Online publication date: 8-Sep-2023
https://doi.org/10.1007/s12652-023-04683-w
Biradar HGawande J(2022)Heart Rate Estimation from Facial Video Sequences using Fast Independent Component Analysis2022 National Conference on Communications (NCC)10.1109/NCC55593.2022.9806810(88-93)Online publication date: 24-May-2022
https://doi.org/10.1109/NCC55593.2022.9806810
Chen QWang YLiu XLong XYin BChen CChen W(2021)Camera-based heart rate estimation for hospitalized newborns in the presence of motion artifactsBioMedical Engineering OnLine10.1186/s12938-021-00958-520:1Online publication date: 4-Dec-2021
https://doi.org/10.1186/s12938-021-00958-5
Adeel AGogate MHussain AWhitmer W(2021)Lip-Reading Driven Deep Learning Approach for Speech EnhancementIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2019.29170395:3(481-490)Online publication date: Jun-2021
https://doi.org/10.1109/TETCI.2019.2917039
Wang CJiang YCai ZLin L(2019)Non-contact Measurement of Heart Rate Based on Facial Video2019 Photonics & Electromagnetics Research Symposium - Fall (PIERS - Fall)10.1109/PIERS-Fall48861.2019.9021402(2269-2275)Online publication date: Dec-2019
https://doi.org/10.1109/PIERS-Fall48861.2019.9021402
Dwith Chenna YGhassemi PPfefer TCasamento JWang Q(2018)Free-Form Deformation Approach for Registration of Visible and Infrared Facial Images in Fever ScreeningSensors10.3390/s1801012518:1(125)Online publication date: 4-Jan-2018
https://doi.org/10.3390/s18010125
Alghoul KAlharthi SAl Osman HEl Saddik A(2017)Heart Rate Variability Extraction From Videos Signals: ICA vs. EVM ComparisonIEEE Access10.1109/ACCESS.2017.26785215(4711-4719)Online publication date: 2017
https://doi.org/10.1109/ACCESS.2017.2678521
Katsaggelos ABahaadini SMolina R(2015)Audiovisual Fusion: Challenges and New ApproachesProceedings of the IEEE10.1109/JPROC.2015.2459017103:9(1635-1653)Online publication date: Sep-2015
https://doi.org/10.1109/JPROC.2015.2459017
Wu CLin JWei WCheng K(2013)Emotion recognition from multi-modal information2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference10.1109/APSIPA.2013.6694347(1-8)Online publication date: Oct-2013
https://doi.org/10.1109/APSIPA.2013.6694347
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten