ABSTRACT
This paper presents a novel approach in continuous emotion prediction that characterizes dimensional emotion labels jointly with continuous and discretized representations. Continuous emotion labels can capture subtle emotion variations, but their inherent noise often has negative effects on model training. Recent approaches found a performance gain when converting the continuous labels into a discrete set (e.g., using k-means clustering), despite a label quantization error. To find the optimal trade-off between the continuous and discretized emotion representations, we investigate two joint modeling approaches: ensemble and end-to-end. The ensemble model combines the predictions from two models that are trained separately, one with discretized prediction and the other with continuous prediction. On the other hand, the end-to-end model is trained to simultaneously optimize both discretized and continuous prediction tasks in addition to the final combination between them. Our experimental results using the state-of-the-art deep BLSTM network on the RECOLA dataset demonstrate that (i) the joint representation outperforms both individual representation baselines and the state-of-the-art speech based results on RECOLA, validating the assumption that combining continuous and discretized emotion representations yields better performance in emotion prediction; and (ii) the joint representation can help to accelerate convergence, particularly for valence prediction. Our work provides insights into joint discrete and continuous emotion representation and its efficacy for describing dynamically changing affective behavior in valence and activation prediction.
- Mart'ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et almbox. . 2016. TensorFlow: A System for Large-Scale Machine Learning. OSDI, Vol. Vol. 16. 265--283. Google ScholarDigital Library
- Kevin Brady, Youngjune Gwon, Pooya Khorrami, Elizabeth Godoy, William Campbell, Charlie Dagli, and Thomas S Huang . 2016. Multi-modal audio, video and physiological sensor learning for continuous emotion prediction. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 97--104. Google ScholarDigital Library
- Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan . 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation Vol. 42, 4 (2008), 335.Google Scholar
- Linlin Chao, Jianhua Tao, Minghao Yang, Ya Li, and Zhengqi Wen . 2015. Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 65--72. Google ScholarDigital Library
- Ira Cohen, Ashutosh Garg, Thomas S Huang, et almbox. . 2000. Emotion recognition from facial expressions using multilevel HMM Neural information processing systems, Vol. Vol. 2. Citeseer.Google Scholar
- Li Deng and John C Platt . 2014. Ensemble deep learning for speech recognition. In Fifteenth Annual Conference of the International Speech Communication Association.Google Scholar
- Beno^ıt Frénay and Michel Verleysen . 2014. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems Vol. 25, 5 (2014), 845--869.Google Scholar
- Sayan Ghosh, Eugene Laksana, Louis-Philippe Morency, and Stefan Scherer . 2016. Representation Learning for Speech Emotion Recognition. INTERSPEECH. 3603--3607.Google Scholar
- Xavier Glorot and Yoshua Bengio . 2010. Understanding the difficulty of training deep feedforward neural networks Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249--256.Google Scholar
- Hatice Gunes and Björn Schuller . 2013. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image and Vision Computing Vol. 31, 2 (2013), 120--136. Google ScholarDigital Library
- Jing Han, Zixing Zhang, Maximilian Schmitt, Maja Pantic, and Björn Schuller . 2017. From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 890--897. Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Lang He, Dongmei Jiang, Le Yang, Ercheng Pei, Peng Wu, and Hichem Sahli . 2015. Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 73--80. Google ScholarDigital Library
- Sepp Hochreiter and Jürgen Schmidhuber . 1997. Long short-term memory. Neural computation Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
- Zhengwei Huang, Ming Dong, Qirong Mao, and Yongzhao Zhan . 2014. Speech emotion recognition using CNN. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 801--804. Google ScholarDigital Library
- Heysem Kaya, Furkan Gürpınar, and Albert Ali Salah . 2017. Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image and Vision Computing Vol. 65 (2017), 66--75. Google ScholarDigital Library
- Soheil Khorram, Zakaria Aldeneh, Dimitrios Dimitriadis, Melvin McInnis, and Emily Mower Provost . 2017. Capturing Long-term Temporal Dependencies with Convolutional Networks for Continuous Emotion Recognition. arXiv preprint arXiv:1708.07050 (2017).Google Scholar
- I Lawrence and Kuei Lin . 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics (1989), 255--268.Google Scholar
- Duc Le, Zakaria Aldeneh, and Emily Mower Provost . 2017. Discretized continuous speech emotion recognition with multi-task deep recurrent neural network. Interspeech, 2017 (to apear) (2017).Google ScholarCross Ref
- Gil Levi and Tal Hassner . 2015. Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In Proceedings of the 2015 ACM on international conference on multimodal interaction. ACM, 503--510. Google ScholarDigital Library
- Paula Lopez-Otero, Laura Docio-Fernandez, and Carmen Garcia-Mateo . 2014. iVectors for continuous emotion recognition. Training Vol. 45 (2014), 50.Google Scholar
- Xingchen Ma, Hongyu Yang, Qiang Chen, Di Huang, and Yunhong Wang . 2016. Depaudionet: An efficient deep model for audio based depression classification Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 35--42. Google ScholarDigital Library
- Qirong Mao, Ming Dong, Zhengwei Huang, and Yongzhao Zhan . 2014. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia Vol. 16, 8 (2014), 2203--2213.Google ScholarCross Ref
- Soroosh Mariooryad and Carlos Busso . 2015. Correcting time-continuous emotional labels by modeling the reaction lag of evaluators. IEEE Transactions on Affective Computing Vol. 6, 2 (2015), 97--108.Google ScholarCross Ref
- Gary McKeown, Michel Valstar, Roddy Cowie, Maja Pantic, and Marc Schroder . 2012. The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing Vol. 3, 1 (2012), 5--17. Google ScholarDigital Library
- Hongying Meng and Nadia Bianchi-Berthouze . 2014. Affective state level recognition in naturalistic facial and vocal expressions. IEEE Transactions on Cybernetics Vol. 44, 3 (2014), 315--328.Google ScholarCross Ref
- Angeliki Metallinou, Martin Wöllmer, Athanasios Katsamanis, Florian Eyben, Björn Schuller, and Shrikanth Narayanan . 2012. Context-sensitive learning for enhanced audiovisual emotion classification. Affective Computing, IEEE Transactions on Vol. 3, 2 (2012), 184--198. Google ScholarDigital Library
- Donn Morrison, Ruili Wang, and Liyanage C De Silva . 2007. Ensemble methods for spoken emotion recognition in call-centres. Speech communication Vol. 49, 2 (2007), 98--112. Google ScholarDigital Library
- Tin Lay Nwe, Say Wei Foo, and Liyanage C De Silva . 2003. Speech emotion recognition using hidden Markov models. Speech communication Vol. 41, 4 (2003), 603--623.Google Scholar
- Jonathan Posner, James A Russell, and Bradley S Peterson . 2005. The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology. Development and psychopathology Vol. 17, 3 (2005), 715--734.Google Scholar
- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et almbox. . 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.Google Scholar
- Filip Povolny, Pavel Matejka, Michal Hradis, Anna Popková, Lubom'ır Otrusina, Pavel Smrz, Ian Wood, Cecile Robin, and Lori Lamel . 2016. Multimodal emotion recognition for AVEC 2016 challenge Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 75--82. Google ScholarDigital Library
- Fabien Ringeval, Björn Schuller, Michel Valstar, Shashank Jaiswal, Erik Marchi, Denis Lalanne, Roddy Cowie, and Maja Pantic . 2015. Av$^Google Scholar
- $ec 2015: The first affect recognition challenge bridging across audio, video, and physiological data. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 3--8. Google ScholarDigital Library
- Fabien Ringeval, Andreas Sonderegger, Juergen Sauer, and Denis Lalanne . 2013. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on. IEEE, 1--8.Google ScholarCross Ref
- Marc Schroder, Elisabetta Bevacqua, Roddy Cowie, Florian Eyben, Hatice Gunes, Dirk Heylen, Mark Ter Maat, Gary McKeown, Sathish Pammi, Maja Pantic, et almbox. . 2012. Building autonomous sensitive artificial listeners. IEEE Transactions on Affective Computing Vol. 3, 2 (2012), 165--183. Google ScholarDigital Library
- Marc Schröder, Sathish Pammi, Hatice Gunes, Maja Pantic, Michel F Valstar, Roddy Cowie, Gary McKeown, Dirk Heylen, Mark Ter Maat, Florian Eyben, et almbox. . 2011. Come and have an emotional workout with sensitive artificial listeners! Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on. IEEE, 646--646.Google Scholar
- Björn Schuller, Stephan Reiter, Ronald Muller, Marc Al-Hames, Manfred Lang, and Gerhard Rigoll . 2005. Speaker independent speech emotion recognition by ensemble classification Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on. IEEE, 864--867.Google Scholar
- Björn Schuller, Gerhard Rigoll, and Manfred Lang . 2003. Hidden Markov model-based speech emotion recognition Multimedia and Expo, 2003. ICME'03. Proceedings. 2003 International Conference on, Vol. Vol. 1. IEEE, I--401. Google ScholarDigital Library
- Mohammad Soleymani, Sander Koelstra, Ioannis Patras, and Thierry Pun . 2011. Continuous emotion detection in response to music videos Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on. IEEE, 803--808.Google Scholar
- Yang Sun, Louis Ten Bosch, and Lou Boves . 2010. Hybrid HMM/BLSTM-RNN for robust speech recognition International Conference on Text, Speech and Dialogue. Springer, 400--407. Google ScholarDigital Library
- George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou . 2016. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 5200--5204.Google ScholarDigital Library
- Panagiotis Tzirakis, George Trigeorgis, Mihalis A Nicolaou, Björn W Schuller, and Stefanos Zafeiriou . 2017. End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing Vol. 11, 8 (2017), 1301--1309.Google ScholarCross Ref
- Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic . 2016. Avec 2016: Depression, mood, and emotion recognition workshop and challenge Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 3--10. Google ScholarDigital Library
- Martin Wöllmer, Florian Eyben, Stephan Reiter, Björn Schuller, Cate Cox, Ellen Douglas-Cowie, and Roddy Cowie . 2008. Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies. In Proc. 9th Interspeech 2008 incorp. 12th Australasian Int. Conf. on Speech Science and Technology SST 2008, Brisbane, Australia. 597--600.Google Scholar
- Martin Wöllmer, Angeliki Metallinou, Nassos Katsamanis, Björn Schuller, and Shrikanth Narayanan . 2012. Analyzing the memory of BLSTM neural networks for enhanced emotion classification in dyadic spoken interactions. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 4157--4160.Google ScholarCross Ref
- Martin Wollmer, Björn Schuller, Florian Eyben, and Gerhard Rigoll . 2010. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE Journal of Selected Topics in Signal Processing Vol. 4, 5 (2010), 867--881.Google ScholarCross Ref
- Xinzhou Xu, Jun Deng, Maryna Gavryukova, Zixing Zhang, Li Zhao, and Björn Schuller . 2016. Multiscale kernel locally penalised discriminant analysis exemplified by emotion recognition in speech. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 233--237. Google ScholarDigital Library
- Yi-Hsuan Yang and Homer H Chen . 2011. Prediction of the distribution of perceived music emotions using discrete samples. IEEE Transactions on Audio, Speech, and Language Processing Vol. 19, 7 (2011), 2184--2196. Google ScholarDigital Library
- Biqiao Zhang, Georg Essl, and Emily Mower Provost . 2017. Predicting the distribution of emotion perception: capturing inter-rater variability Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 51--59. Google ScholarDigital Library
- Biqiao Zhang, Emily Mower Provost, Robert Swedberg, and Georg Essl . 2015. Predicting Emotion Perception Across Domains: A Study of Singing and Speaking. AAAI. 1328--1335. Google ScholarDigital Library
- Shiliang Zhang, Qi Tian, Shuqiang Jiang, Qingming Huang, and Wen Gao . 2008. Affective MTV analysis based on arousal and valence features Multimedia and Expo, 2008 IEEE International Conference on. IEEE, 1369--1372.Google Scholar
Index Terms
- Joint Discrete and Continuous Emotion Prediction Using Ensemble and End-to-End Approaches
Recommendations
Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction
AVEC '16: Proceedings of the 6th International Workshop on Audio/Visual Emotion ChallengeThe automatic determination of emotional state from multimedia content is an inherently challenging problem with a broad range of applications including biomedical diagnostics, multimedia retrieval, and human computer interfaces. The Audio Video Emotion ...
Speech-based Continuous Emotion Prediction by Learning Perception Responses related to Salient Events: A Study based on Vocal Affect Bursts and Cross-Cultural Affect in AVEC 2018
AVEC'18: Proceedings of the 2018 on Audio/Visual Emotion Challenge and WorkshopThis paper presents a novel framework for speech-based continuous emotion prediction. The proposed model characterises the perceived emotion estimation as time-invariant responses to salient events. Then arousal and valence variation over time is ...
End-to-End Continuous Emotion Recognition from Video Using 3D Convlstm Networks
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)Conventional continuous emotion recognition consists of feature extraction step followed by regression step. However, the objective of the two steps is not consistent as they are parted. Besides, there is still no consensus about appropriate emotional ...
Comments