ABSTRACT
Content analysis of clips containing people speaking involves processing informative cues coming from different modalities. These cues are usually the words extracted from the audio modality, and the identity of the persons appearing in the video modality of the clip. To achieve efficient assignment of these cues to the person that created them, we propose a Bayesian network model that utilizes the extracted feature characteristics, their relations and their temporal patterns. We use the EM algorithm in which the E-step estimates the expectation of the complete-data log-likelihood with respect to the hidden variables - that is the identity of the speakers and the visible persons. In the M-step , the person models that maximize this expectation are computed. This framework produces excellent results, exhibiting exceptional robustness when dealing with low quality data.
- J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, 1997.Google Scholar
- O. Chapelle, P. Haffner, and V. Vapnik. Svms for histogram-based image classification, 1999.Google Scholar
- S. Chen and P. Gopalakrishnan. Speaker, environment and channel change detection and clustering via the bayesian information criterion, 1998.Google Scholar
- T. Darrell, J. W. Fisher, III, P. Viola, and W. Freeman. Audio-visual segmentation and "the cocktail party effect".Google Scholar
- J. G. Fiscus, N. Radde, J. S. Garofolo, A. Le, J. Ajot, and C. Laprun. Rich transcription 2005 spring meeting recognition evaluation, 2005.Google Scholar
- J. W. F. III and T. Darrel. Probabilistic models and informative subspaces for audiovisual correspondence. TUGboat, 12(2):291--301, June 1991.Google Scholar
- J. W. F. III and T. Darrell. Probabalistic models and informative subspaces for audiovisual correspondence. In ECCV (3), pages 592--603, 2002. Google ScholarDigital Library
- J. W. F. III, T. Darrell, W. T. Freeman, and P. A. Viola. Learning joint statistical models for audio-visual fusion and segregation. In NIPS, pages 772--778, 2000.Google Scholar
- L. Lu and H.-J. Zhang. Speaker change detection and tracking in real-time news broadcasting analysis. In MULTIMEDIA '02: Proceedings of the tenth ACM international conference on Multimedia, pages 602--610, New York, NY, USA, 2002. ACM Press. Google ScholarDigital Library
- J. Ming, J. Lin, and F. J. Smith. A posterior union model with applications to robust speech and speaker recognition. EURASIP Journal on Applied Signal Processing, 2006, December 2006. Google ScholarDigital Library
- B. Moghaddam and A. Pentland. Face recognition using view-based and modular eigenspaces. In Automatic Systems for the Identification and Inspection of Humans, SPIE'94, volume 2257, 1994.Google Scholar
- H. J. Nock, G. Iyengar, and C. Neti. Multimodal processing by finding common cause. Commun. ACM, 47(1):51--56, 2004. Google ScholarDigital Library
- L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Apllications in Speech Recognition. Kaufmann, San Mateo, CA, 1990.Google Scholar
- K. Saenko, K. Livescu, M. Siracusa, K. Wilson, J. Glass, and T. Darrell. Visual speech recognition with loosely synchronized feature streams. In ICCV '05: Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 2, pages 1424--1431, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 586--591, 1991.Google ScholarCross Ref
- P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 2002. Google ScholarDigital Library
- M.-H. Yang, D. J. Kriegman, and N. Ahuja. Detecting faces in images: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 24(1):34--58, 2002. Google ScholarDigital Library
- W. Zhao, R. Chellappa, A. Rosenfeld, and P. Phillips. Face recognition: A literature survey, 2000.Google Scholar
Index Terms
- EM detection of common origin of multi-modal cues
Recommendations
On-line multi-modal speaker diarization
ICMI '07: Proceedings of the 9th international conference on Multimodal interfacesThis paper presents a novel framework that utilizes multi-modal information to achieve speaker diarization. We use dynamic Bayesian networks to achieve on-line results. We progress from a simple observation model to a complex multi-modal one as more ...
Focusing computational visual attention in multi-modal human-robot interaction
ICMI-MLMI '10: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal InteractionIdentifying verbally and non-verbally referred-to objects is an important aspect of human-robot interaction. Most importantly, it is essential to achieve a joint focus of attention and, thus, a natural interaction behavior. In this contribution, we ...
Real-time Emotion Pre-Recognition in Conversations with Contrastive Multi-modal Dialogue Pre-training
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge ManagementThis paper presents our pioneering effort in addressing a new and realistic scenario in multi-modal dialogue systems called Multi-modal Real-time Emotion Pre-recognition in Conversations (MREPC). The objective is to predict the emotion of a forthcoming ...
Comments