ABSTRACT
Emotion recognition is a challenging task because of the emotional gap between subjective emotion and the low-level audio-visual features. Inspired by the recent success of deep learning in bridging the semantic gap, this paper proposes to bridge the emotional gap based on a multimodal Deep Convolution Neural Network (DCNN), which fuses the audio and visual cues in a deep model. This multimodal DCNN is trained with two stages. First, two DCNN models pre-trained on large-scale image data are fine-tuned to perform audio and visual emotion recognition tasks respectively on the corresponding labeled speech and face data. Second, the outputs of these two DCNNs are integrated in a fusion network constructed by a number of fully-connected layers. The fusion network is trained to obtain a joint audio-visual feature representation for emotion recognition. Experimental results on the RML audio-visual database demonstrates the promising performance of the proposed method. To the best of our knowledge, this is an early work fusing audio and visual cues in DCNN for emotion recognition. Its success guarantees further research in this direction.
- Y. Cao, Y. Chen, and D. Khosla. Spiking deep convolutional neural networks for energy-efficient object recognition. Int. J. Comput. Vis., 113(1):54--66, 2015. Google ScholarDigital Library
- A. Hanjalic and L. Xu. Affective video content representation and modeling. IEEE Trans. Multimed., 7(1):143--154, 2005. Google ScholarDigital Library
- G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, 2006.Google ScholarCross Ref
- Y. Kim, H. Lee, and E. M. Provost. Deep learning for robust feature generation in audiovisual emotion recognition. In ICASSP, 2013.Google ScholarCross Ref
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.Google ScholarDigital Library
- M. Mansoorizadeh and N. M. Charkari. Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tool. Appl., 49(2):277--297, 2010. Google ScholarDigital Library
- W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C.-C. Loy, et al. Deepid-net: Deformable deep convolutional neural networks for object detection. In CVPR, 2015.Google Scholar
- L. Pang and C.-W. Ngo. Multimodal learning with deep boltzmann machine for emotion prediction in user generated videos. In ICMR, 2015. Google ScholarDigital Library
- B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. A. Müller, and S. S. Narayanan. The interspeech 2010 paralinguistic challenge. In INTERSPEECH, 2010.Google ScholarCross Ref
- C. Shan, S. Gong, and P. W. McOwan. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vis. Comput., 27(6):803--816, 2009. Google ScholarDigital Library
- Y. Wang and L. Guan. Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimed., 10(5):936--946, 2008. Google ScholarDigital Library
- Z. Zeng, M. Pantic, G. Roisman, T. S. Huang, et al. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell., 31(1):39--58, 2009. Google ScholarDigital Library
- S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian. Affective visualization and retrieval for music video. IEEE Trans. Multimed., 12(6):510--522, 2010. Google ScholarDigital Library
- S. Zhang, Q. Huang, Q. Tian, S. Jiang, and W. Gao. i. mtv: an integrated system for mtv affective analysis. In ACM MM, 2008. Google ScholarDigital Library
- X. Zhao and S. Zhang. Facial expression recognition using local binary patterns and discriminant kernel locally linear embedding. EURASIP J. Adv. Signal Process., 2012(1):1--9, 2012.Google ScholarCross Ref
Index Terms
- Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
Recommendations
Emotion Recognition Using Multimodal Deep Learning
Proceedings of the 23rd International Conference on Neural Information Processing - Volume 9948To enhance the performance of affective models and reduce the cost of acquiring physiological signals for real-world applications, we adopt multimodal deep learning approach to construct affective models with SEED and DEAP datasets to recognize ...
Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal InteractionIn this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from ...
Bimodal Emotion Recognition Based on Convolutional Neural Network
ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and ComputingComputer emotion recognition plays an important role in the field of artificial intelligence and is a key technology to realize human-machine interaction. Aiming at a cross-modal fusion problem of two nonlinear features of facial expression image and ...
Comments