short-paper

Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition

Authors:
Shiqing Zhang

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Shiliang Zhang

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Tiejun Huang

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Wen Gao

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia RetrievalJune 2016Pages 281–284https://doi.org/10.1145/2911996.2912051

Published:06 June 2016Publication History

ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Pages 281–284

ABSTRACT

Emotion recognition is a challenging task because of the emotional gap between subjective emotion and the low-level audio-visual features. Inspired by the recent success of deep learning in bridging the semantic gap, this paper proposes to bridge the emotional gap based on a multimodal Deep Convolution Neural Network (DCNN), which fuses the audio and visual cues in a deep model. This multimodal DCNN is trained with two stages. First, two DCNN models pre-trained on large-scale image data are fine-tuned to perform audio and visual emotion recognition tasks respectively on the corresponding labeled speech and face data. Second, the outputs of these two DCNNs are integrated in a fusion network constructed by a number of fully-connected layers. The fusion network is trained to obtain a joint audio-visual feature representation for emotion recognition. Experimental results on the RML audio-visual database demonstrates the promising performance of the proposed method. To the best of our knowledge, this is an early work fusing audio and visual cues in DCNN for emotion recognition. Its success guarantees further research in this direction.

References

Y. Cao, Y. Chen, and D. Khosla. Spiking deep convolutional neural networks for energy-efficient object recognition. Int. J. Comput. Vis., 113(1):54--66, 2015. Google ScholarDigital Library
A. Hanjalic and L. Xu. Affective video content representation and modeling. IEEE Trans. Multimed., 7(1):143--154, 2005. Google ScholarDigital Library
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, 2006.Google ScholarCross Ref
Y. Kim, H. Lee, and E. M. Provost. Deep learning for robust feature generation in audiovisual emotion recognition. In ICASSP, 2013.Google ScholarCross Ref
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.Google ScholarDigital Library
M. Mansoorizadeh and N. M. Charkari. Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tool. Appl., 49(2):277--297, 2010. Google ScholarDigital Library
W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C.-C. Loy, et al. Deepid-net: Deformable deep convolutional neural networks for object detection. In CVPR, 2015.Google Scholar
L. Pang and C.-W. Ngo. Multimodal learning with deep boltzmann machine for emotion prediction in user generated videos. In ICMR, 2015. Google ScholarDigital Library
B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. A. Müller, and S. S. Narayanan. The interspeech 2010 paralinguistic challenge. In INTERSPEECH, 2010.Google ScholarCross Ref
C. Shan, S. Gong, and P. W. McOwan. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vis. Comput., 27(6):803--816, 2009. Google ScholarDigital Library
Y. Wang and L. Guan. Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimed., 10(5):936--946, 2008. Google ScholarDigital Library
Z. Zeng, M. Pantic, G. Roisman, T. S. Huang, et al. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell., 31(1):39--58, 2009. Google ScholarDigital Library
S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian. Affective visualization and retrieval for music video. IEEE Trans. Multimed., 12(6):510--522, 2010. Google ScholarDigital Library
S. Zhang, Q. Huang, Q. Tian, S. Jiang, and W. Gao. i. mtv: an integrated system for mtv affective analysis. In ACM MM, 2008. Google ScholarDigital Library
X. Zhao and S. Zhang. Facial expression recognition using local binary patterns and discriminant kernel locally linear embedding. EURASIP J. Adv. Signal Process., 2012(1):1--9, 2012.Google ScholarCross Ref

Index Terms

Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Emotion Recognition Using Multimodal Deep Learning
Proceedings of the 23rd International Conference on Neural Information Processing - Volume 9948

To enhance the performance of affective models and reduce the cost of acquiring physiological signals for real-world applications, we adopt multimodal deep learning approach to construct affective models with SEED and DEAP datasets to recognize ...
Read More
Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction

In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from ...
Read More
Bimodal Emotion Recognition Based on Convolutional Neural Network
ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and Computing

Computer emotion recognition plays an important role in the field of artificial intelligence and is a key technology to realize human-machine interaction. Aiming at a cross-modal fusion problem of two nonlinear features of facial expression image and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval
June 2016
452 pages
ISBN:9781450343596
DOI:10.1145/2911996
General Chairs:
John R. Kender
Columbia University, USA
,
John R. Smith
IBM Research, USA
,
Program Chairs:
Jiebo Luo
University of Rochester, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Winston Hsu
National Taiwan University, Taiwan
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep convolution neural network
emotion recognition
multimodal deep learning
Qualifiers
- short-paper
Conference

Acceptance Rates
ICMR '16 Paper Acceptance Rate20of120submissions,17%Overall Acceptance Rate254of830submissions,31%
More
Upcoming Conference
ICMR '24

Sponsor:

sigmm

International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket , Thailand
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 56
  Total Citations
  View Citations
- 1,630
  Total Downloads
- Downloads (Last 12 months)74
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition

ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Emotion Recognition Using Multimodal Deep Learning

Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild

Bimodal Emotion Recognition Based on Convolutional Neural Network