skip to main content
10.1145/2911996.2912051acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
short-paper

Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition

Published:06 June 2016Publication History

ABSTRACT

Emotion recognition is a challenging task because of the emotional gap between subjective emotion and the low-level audio-visual features. Inspired by the recent success of deep learning in bridging the semantic gap, this paper proposes to bridge the emotional gap based on a multimodal Deep Convolution Neural Network (DCNN), which fuses the audio and visual cues in a deep model. This multimodal DCNN is trained with two stages. First, two DCNN models pre-trained on large-scale image data are fine-tuned to perform audio and visual emotion recognition tasks respectively on the corresponding labeled speech and face data. Second, the outputs of these two DCNNs are integrated in a fusion network constructed by a number of fully-connected layers. The fusion network is trained to obtain a joint audio-visual feature representation for emotion recognition. Experimental results on the RML audio-visual database demonstrates the promising performance of the proposed method. To the best of our knowledge, this is an early work fusing audio and visual cues in DCNN for emotion recognition. Its success guarantees further research in this direction.

References

  1. Y. Cao, Y. Chen, and D. Khosla. Spiking deep convolutional neural networks for energy-efficient object recognition. Int. J. Comput. Vis., 113(1):54--66, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Hanjalic and L. Xu. Affective video content representation and modeling. IEEE Trans. Multimed., 7(1):143--154, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  4. Y. Kim, H. Lee, and E. M. Provost. Deep learning for robust feature generation in audiovisual emotion recognition. In ICASSP, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  5. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Mansoorizadeh and N. M. Charkari. Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tool. Appl., 49(2):277--297, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C.-C. Loy, et al. Deepid-net: Deformable deep convolutional neural networks for object detection. In CVPR, 2015.Google ScholarGoogle Scholar
  8. L. Pang and C.-W. Ngo. Multimodal learning with deep boltzmann machine for emotion prediction in user generated videos. In ICMR, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. A. Müller, and S. S. Narayanan. The interspeech 2010 paralinguistic challenge. In INTERSPEECH, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  10. C. Shan, S. Gong, and P. W. McOwan. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vis. Comput., 27(6):803--816, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Wang and L. Guan. Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimed., 10(5):936--946, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z. Zeng, M. Pantic, G. Roisman, T. S. Huang, et al. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell., 31(1):39--58, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian. Affective visualization and retrieval for music video. IEEE Trans. Multimed., 12(6):510--522, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Zhang, Q. Huang, Q. Tian, S. Jiang, and W. Gao. i. mtv: an integrated system for mtv affective analysis. In ACM MM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Zhao and S. Zhang. Facial expression recognition using local binary patterns and discriminant kernel locally linear embedding. EURASIP J. Adv. Signal Process., 2012(1):1--9, 2012.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval
      June 2016
      452 pages
      ISBN:9781450343596
      DOI:10.1145/2911996

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 June 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      ICMR '16 Paper Acceptance Rate20of120submissions,17%Overall Acceptance Rate254of830submissions,31%

      Upcoming Conference

      ICMR '24
      International Conference on Multimedia Retrieval
      June 10 - 14, 2024
      Phuket , Thailand

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader