ABSTRACT
Understanding nonverbal behaviors in human machine interaction is a complex and challenge task. One of the key aspects is to recognize human emotion states accurately. This paper presents our effort to the Audio/Visual Emotion Challenge (AVEC'14), whose goal is to predict the continuous values of the emotion dimensions arousal, valence and dominance at each moment in time. The proposed method utilizes deep belief network based models to recognize emotion states from audio and visual modalities. Firstly, we employ temporal pooling functions in the deep neutral network to encode dynamic information in the features, which achieves the first time scale temporal modeling. Secondly, we combine the predicted results from different modalities and emotion temporal context information simultaneously. The proposed multimodal-temporal fusion achieves temporal modeling for the emotion states in the second time scale. Experiments results show the efficiency of each key point of the proposed method and competitive results are obtained
- J. Tao and T. Tan, Affective Computing: A Review, Proc. First Int'l Conf. Affective Computing and Intelligent Interaction, J. Tao, T. Tan, and R.W. Picard, eds., pp. 981--995, 2005. Google ScholarDigital Library
- A. Dhall, R. Goecke, J. Joshi, M. Wagner, and T. Gedeon, Emotion recognition in the wild challenge 2013. In ACM International Conference on Multimodal Interaction, 2013. Google ScholarDigital Library
- H. Gunes, M. Pantic. Automatic, dimensional and continuous emotion recognition{J}. International Journal of Synthetic Emotions (IJSE), 2010, 1(1): 68--99. Google ScholarDigital Library
- A. Mehrabian, and J. Russell, An approach to environmental psychology. Cambridge, MA: MIT Press,Google Scholar
- J. Davitz, Auditory correlates of vocal expression of emotional feeling. In J. Davitz (Ed.), The communication of emotional meaning (pp. 101--112).New York: McGraw-Hill, 1964.Google Scholar
- P. Ekman and E. L. Rosenberg, What the Face Reveals : Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System, seconded. Oxford Univ. Press, 2005.Google Scholar
- K. R. Scherer, Appraisal Theory, Handbook of Cognition and Emotion, T. Dalgleish and M. J. Power, eds., pp. 637--663, Wiley,1999.Google Scholar
- N. Sebe, I. Cohen, T. Gevers, and T. S. Huang. Multimodal approaches for emotion recognition: a survey. In S. Santini, R. Schettini, and T. Gevers, editors, Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 5670 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, pages56--67, Dec.2004.Google Scholar
- T. Balomenos, A. Raouzaiou, S. Ioannou, A. Drosopoulos, K. Karpouzis, and S. Kollias. Emotion analysis in man machine interaction systems. In in Proc. MLMI, LNCS3361, pages 318--328, 2004. Google ScholarDigital Library
- R. Lane and L. Nadel, Cognitive Neuroscience of Emotion. Oxford Univ. Press, 2000.Google Scholar
- P. A. Lewisetal , Neural correlates of processing valence and arousal in affective words, CerebralCortex,vol.17,no.3, pp. 742--748, Mar 2007.Google Scholar
- M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, and M. Pantic, AVEC 2014 -- 3D Dimensional Affect and Depression Recognition Challenge, proc. 4th ACM international workshop on Audio/visual emotion challenge, 2014. Google ScholarDigital Library
- H. Meng, N. Bianchi-Berthouze, Naturalistic Affective Expression Classification by a Multi-stage Approach Based on Hidden Markov Models, In Affective Computing and Intelligent Interaction (pp. 378--387). Springer Berlin Heidelberg, 2011. Google ScholarDigital Library
- M. Glodek, S. Tschechne, G. Layher, M. Schels, T. Brosch, S. Scherer, F. Schwenker, Multiple Classifier Systems for the Classification of Audio-Visual Emotion States, In Affective Computing and Intelligent Interaction (pp. 378--387). Springer Berlin Heidelberg, 2011. Google ScholarDigital Library
- G. A. Ramirez, T. Baltrušaitis, and L. P. Morency, Modeling latent discriminative dynamic of multi-dimensional affective signals. In Affective Computing and Intelligent Interaction (pp. 396--406). Springer Berlin Heidelberg, 2011. Google ScholarDigital Library
- M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller, LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework, Image and Vision Computing, 2012.Google Scholar
- M. Wöllmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, R. Cowie, Abandoning emotion classes -- towards continuous emotion recognition with modeling of long-range dependences, In Proc. Interspeech, pp. 597--600, 2008.Google Scholar
- M. Wöllmer, B. Schuller, F. Eyben, G. Rigoll, Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening, IEEE Journal of Selected Topics in Signal Processing (J-STSP), Special Issue on Speech Processing for Natural Interaction with Intelligent Environments, Vol 4, Issue 5, 867--881, 2010Google Scholar
- J. Nicolle, V. Rapp, K. Bailly, L. Prevost and M. Chetouani, Robust continuous prediction of human emotions using multi-scale dynamic cues, In Proceedings of the 14th ACM international conference on Multimodal interaction, ACM, pp.477--484,2012. Google ScholarDigital Library
- H. Meng, D. Huang, H. Wang, H. Yang, M. Al-Shuraifi and Y. Wang, Depression Recognition based on Dynamic Facial and Vocal Expression Features using Partial Least Square Regression, proc. 3rd ACM international workshop on Audio/visual emotion challenge, 2013. Google ScholarDigital Library
- H. Gunes, M. Piccardi, and M. Pantic, From the Lab to the Real World: Affect Recognition Usng, Affective Computing: Focus on Emotion Expression, Synthesis, and Recognition. I-Tech Education and Publishing, Vienna, Austria, pp. 185 - 218, 2008.Google Scholar
- M. A. Nicolaou, H. Gunes, and M. Pantic, Output-associative RVM regression for dimensional and continuous emotion prediction. Image and Vision Computing, 30(3), 186--196, (2012). Google ScholarDigital Library
- A. Stuhlsatz, C. Meyer, F. Eyben, T. ZieIke, G. Meier, and B. Schuller, Deep neural networks for acoustic emotion recognition: raising the benchmarks. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on (pp. 5688--5691). IEEE.Google Scholar
- S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, C. Gülçehre, *, R. Memisevic, P. Vincent, A. Courville, and Y. Bengio, Combining Modality Specific Deep Neural Networks for Emotion Recognition in Video. In Proceedings of the 15th ACM International Conference on Multimodal Interaction (ICMI '13) pp. 543--550. Google ScholarDigital Library
- D. Le and E. M. Provost. Emotion Recognition from Spontaneous Speech using Hidden Markov Models with Deep Belief Networks, Automatic Speech Recognition and Understanding (ASRU). Olomouc, Czech Republic. December, 2013.Google Scholar
- Y. Kim, H. Lee, and E. M. Provost, Deep Learning for Robust Feature Generation in Audio-Visual Emotion Recognition, International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Vancouver, British Columbia, Canada. May, 2013.Google Scholar
- P. Hamel, S. Lemieux, Y. Bengio, and D. Eck, Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In ISMIR (pp. 729--734), 2011.Google Scholar
- M. Valstar, K. Smith, F. Eyben, S. Schnieder, and R. Cowie. Avec 2013-the continuous audio/visual emotion and depression recognition challenge. In Proc.3rd ACM international workshop on Audio/visual emotion challenge, pages3--10, 2013. Google ScholarDigital Library
- B. Mathieu, S. Essid, T. Fillon, J. Prado, G. Richard, YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software, proceedings of the 11th ISMIR conference, Utrecht, Netherlands, 2010.Google Scholar
- X. Xiong and F. De la Torre, Supervised descent method and its applications to face alignment{C}//Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013: 532--539. Google ScholarDigital Library
- P. Viola and M. Jones, Robust real time object detection {J}, International Journal of Computer Vision (IJCV), 2001.Google Scholar
- A. Coates, H. Lee, and A. Y. Ng. An Analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.Google Scholar
- Y. Bengio, Deep learning of representations for unsupervised and transfer learning, ICML Unsupervised and Transfer Learning, 2012: 17--36Google Scholar
- L. Chao, J. Tao and M. Yang, Combining Emotional History Through Multimodal Fusion Methods. Asia Pacific Signal and Information Processing Association (APSIPA 2013), Oct.29-Nov.1 2013 .Google Scholar
- C. Soladié, H. Salam, C. Pelachaud, N. Stoiber and R. Séguier, A Multimodal Fuzzy Inference System using a Continuous Facial Expression Representation for Emotion Detection, In Proceedings of the 14th ACM international conference on Multimodal interaction, ACM, 2012. Google ScholarDigital Library
- G. E. Hinton, and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006.Google Scholar
- S. Dobrisek, R. Gajsek, F. Mihelic, N. Pavesic and V. Struc, Towards efficient multi-modal emotion recognition. Int J Adv Robot Syst 10:1--10, 2013.Google ScholarCross Ref
Index Terms
- Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video
Recommendations
Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition
AVEC '15: Proceedings of the 5th International Workshop on Audio/Visual Emotion ChallengeThis paper presents our effort to the Audio/Visual+ Emotion Challenge (AV+EC2015), whose goal is to predict the continuous values of the emotion dimensions arousal and valence from audio, visual and physiology modalities. The state of art classifier for ...
AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge
AVEC '14: Proceedings of the 4th International Workshop on Audio/Visual Emotion ChallengeMood disorders are inherently related to emotion. In particular, the behaviour of people suffering from mood disorders such as unipolar depression shows a strong temporal correlation with the affective dimensions valence, arousal and dominance. In ...
AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge
AVEC '16: Proceedings of the 6th International Workshop on Audio/Visual Emotion ChallengeThe Audio/Visual Emotion Challenge and Workshop (AVEC 2016) "Depression, Mood and Emotion" will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological ...
Comments