skip to main content
10.1145/3242969.3242972acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Joint Discrete and Continuous Emotion Prediction Using Ensemble and End-to-End Approaches

Published:02 October 2018Publication History

ABSTRACT

This paper presents a novel approach in continuous emotion prediction that characterizes dimensional emotion labels jointly with continuous and discretized representations. Continuous emotion labels can capture subtle emotion variations, but their inherent noise often has negative effects on model training. Recent approaches found a performance gain when converting the continuous labels into a discrete set (e.g., using k-means clustering), despite a label quantization error. To find the optimal trade-off between the continuous and discretized emotion representations, we investigate two joint modeling approaches: ensemble and end-to-end. The ensemble model combines the predictions from two models that are trained separately, one with discretized prediction and the other with continuous prediction. On the other hand, the end-to-end model is trained to simultaneously optimize both discretized and continuous prediction tasks in addition to the final combination between them. Our experimental results using the state-of-the-art deep BLSTM network on the RECOLA dataset demonstrate that (i) the joint representation outperforms both individual representation baselines and the state-of-the-art speech based results on RECOLA, validating the assumption that combining continuous and discretized emotion representations yields better performance in emotion prediction; and (ii) the joint representation can help to accelerate convergence, particularly for valence prediction. Our work provides insights into joint discrete and continuous emotion representation and its efficacy for describing dynamically changing affective behavior in valence and activation prediction.

References

  1. Mart'ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et almbox. . 2016. TensorFlow: A System for Large-Scale Machine Learning. OSDI, Vol. Vol. 16. 265--283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Kevin Brady, Youngjune Gwon, Pooya Khorrami, Elizabeth Godoy, William Campbell, Charlie Dagli, and Thomas S Huang . 2016. Multi-modal audio, video and physiological sensor learning for continuous emotion prediction. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 97--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan . 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation Vol. 42, 4 (2008), 335.Google ScholarGoogle Scholar
  4. Linlin Chao, Jianhua Tao, Minghao Yang, Ya Li, and Zhengqi Wen . 2015. Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 65--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ira Cohen, Ashutosh Garg, Thomas S Huang, et almbox. . 2000. Emotion recognition from facial expressions using multilevel HMM Neural information processing systems, Vol. Vol. 2. Citeseer.Google ScholarGoogle Scholar
  6. Li Deng and John C Platt . 2014. Ensemble deep learning for speech recognition. In Fifteenth Annual Conference of the International Speech Communication Association.Google ScholarGoogle Scholar
  7. Beno^ıt Frénay and Michel Verleysen . 2014. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems Vol. 25, 5 (2014), 845--869.Google ScholarGoogle Scholar
  8. Sayan Ghosh, Eugene Laksana, Louis-Philippe Morency, and Stefan Scherer . 2016. Representation Learning for Speech Emotion Recognition. INTERSPEECH. 3603--3607.Google ScholarGoogle Scholar
  9. Xavier Glorot and Yoshua Bengio . 2010. Understanding the difficulty of training deep feedforward neural networks Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249--256.Google ScholarGoogle Scholar
  10. Hatice Gunes and Björn Schuller . 2013. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image and Vision Computing Vol. 31, 2 (2013), 120--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jing Han, Zixing Zhang, Maximilian Schmitt, Maja Pantic, and Björn Schuller . 2017. From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 890--897. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  13. Lang He, Dongmei Jiang, Le Yang, Ercheng Pei, Peng Wu, and Hichem Sahli . 2015. Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 73--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sepp Hochreiter and Jürgen Schmidhuber . 1997. Long short-term memory. Neural computation Vol. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Zhengwei Huang, Ming Dong, Qirong Mao, and Yongzhao Zhan . 2014. Speech emotion recognition using CNN. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 801--804. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Heysem Kaya, Furkan Gürpınar, and Albert Ali Salah . 2017. Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image and Vision Computing Vol. 65 (2017), 66--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Soheil Khorram, Zakaria Aldeneh, Dimitrios Dimitriadis, Melvin McInnis, and Emily Mower Provost . 2017. Capturing Long-term Temporal Dependencies with Convolutional Networks for Continuous Emotion Recognition. arXiv preprint arXiv:1708.07050 (2017).Google ScholarGoogle Scholar
  18. I Lawrence and Kuei Lin . 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics (1989), 255--268.Google ScholarGoogle Scholar
  19. Duc Le, Zakaria Aldeneh, and Emily Mower Provost . 2017. Discretized continuous speech emotion recognition with multi-task deep recurrent neural network. Interspeech, 2017 (to apear) (2017).Google ScholarGoogle ScholarCross RefCross Ref
  20. Gil Levi and Tal Hassner . 2015. Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In Proceedings of the 2015 ACM on international conference on multimodal interaction. ACM, 503--510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Paula Lopez-Otero, Laura Docio-Fernandez, and Carmen Garcia-Mateo . 2014. iVectors for continuous emotion recognition. Training Vol. 45 (2014), 50.Google ScholarGoogle Scholar
  22. Xingchen Ma, Hongyu Yang, Qiang Chen, Di Huang, and Yunhong Wang . 2016. Depaudionet: An efficient deep model for audio based depression classification Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 35--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Qirong Mao, Ming Dong, Zhengwei Huang, and Yongzhao Zhan . 2014. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia Vol. 16, 8 (2014), 2203--2213.Google ScholarGoogle ScholarCross RefCross Ref
  24. Soroosh Mariooryad and Carlos Busso . 2015. Correcting time-continuous emotional labels by modeling the reaction lag of evaluators. IEEE Transactions on Affective Computing Vol. 6, 2 (2015), 97--108.Google ScholarGoogle ScholarCross RefCross Ref
  25. Gary McKeown, Michel Valstar, Roddy Cowie, Maja Pantic, and Marc Schroder . 2012. The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing Vol. 3, 1 (2012), 5--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hongying Meng and Nadia Bianchi-Berthouze . 2014. Affective state level recognition in naturalistic facial and vocal expressions. IEEE Transactions on Cybernetics Vol. 44, 3 (2014), 315--328.Google ScholarGoogle ScholarCross RefCross Ref
  27. Angeliki Metallinou, Martin Wöllmer, Athanasios Katsamanis, Florian Eyben, Björn Schuller, and Shrikanth Narayanan . 2012. Context-sensitive learning for enhanced audiovisual emotion classification. Affective Computing, IEEE Transactions on Vol. 3, 2 (2012), 184--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Donn Morrison, Ruili Wang, and Liyanage C De Silva . 2007. Ensemble methods for spoken emotion recognition in call-centres. Speech communication Vol. 49, 2 (2007), 98--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Tin Lay Nwe, Say Wei Foo, and Liyanage C De Silva . 2003. Speech emotion recognition using hidden Markov models. Speech communication Vol. 41, 4 (2003), 603--623.Google ScholarGoogle Scholar
  30. Jonathan Posner, James A Russell, and Bradley S Peterson . 2005. The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology. Development and psychopathology Vol. 17, 3 (2005), 715--734.Google ScholarGoogle Scholar
  31. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et almbox. . 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.Google ScholarGoogle Scholar
  32. Filip Povolny, Pavel Matejka, Michal Hradis, Anna Popková, Lubom'ır Otrusina, Pavel Smrz, Ian Wood, Cecile Robin, and Lori Lamel . 2016. Multimodal emotion recognition for AVEC 2016 challenge Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 75--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Fabien Ringeval, Björn Schuller, Michel Valstar, Shashank Jaiswal, Erik Marchi, Denis Lalanne, Roddy Cowie, and Maja Pantic . 2015. Av$^Google ScholarGoogle Scholar
  34. $ec 2015: The first affect recognition challenge bridging across audio, video, and physiological data. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 3--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Fabien Ringeval, Andreas Sonderegger, Juergen Sauer, and Denis Lalanne . 2013. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on. IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  36. Marc Schroder, Elisabetta Bevacqua, Roddy Cowie, Florian Eyben, Hatice Gunes, Dirk Heylen, Mark Ter Maat, Gary McKeown, Sathish Pammi, Maja Pantic, et almbox. . 2012. Building autonomous sensitive artificial listeners. IEEE Transactions on Affective Computing Vol. 3, 2 (2012), 165--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Marc Schröder, Sathish Pammi, Hatice Gunes, Maja Pantic, Michel F Valstar, Roddy Cowie, Gary McKeown, Dirk Heylen, Mark Ter Maat, Florian Eyben, et almbox. . 2011. Come and have an emotional workout with sensitive artificial listeners! Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on. IEEE, 646--646.Google ScholarGoogle Scholar
  38. Björn Schuller, Stephan Reiter, Ronald Muller, Marc Al-Hames, Manfred Lang, and Gerhard Rigoll . 2005. Speaker independent speech emotion recognition by ensemble classification Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on. IEEE, 864--867.Google ScholarGoogle Scholar
  39. Björn Schuller, Gerhard Rigoll, and Manfred Lang . 2003. Hidden Markov model-based speech emotion recognition Multimedia and Expo, 2003. ICME'03. Proceedings. 2003 International Conference on, Vol. Vol. 1. IEEE, I--401. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Mohammad Soleymani, Sander Koelstra, Ioannis Patras, and Thierry Pun . 2011. Continuous emotion detection in response to music videos Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on. IEEE, 803--808.Google ScholarGoogle Scholar
  41. Yang Sun, Louis Ten Bosch, and Lou Boves . 2010. Hybrid HMM/BLSTM-RNN for robust speech recognition International Conference on Text, Speech and Dialogue. Springer, 400--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou . 2016. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 5200--5204.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Panagiotis Tzirakis, George Trigeorgis, Mihalis A Nicolaou, Björn W Schuller, and Stefanos Zafeiriou . 2017. End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing Vol. 11, 8 (2017), 1301--1309.Google ScholarGoogle ScholarCross RefCross Ref
  44. Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic . 2016. Avec 2016: Depression, mood, and emotion recognition workshop and challenge Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Martin Wöllmer, Florian Eyben, Stephan Reiter, Björn Schuller, Cate Cox, Ellen Douglas-Cowie, and Roddy Cowie . 2008. Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies. In Proc. 9th Interspeech 2008 incorp. 12th Australasian Int. Conf. on Speech Science and Technology SST 2008, Brisbane, Australia. 597--600.Google ScholarGoogle Scholar
  46. Martin Wöllmer, Angeliki Metallinou, Nassos Katsamanis, Björn Schuller, and Shrikanth Narayanan . 2012. Analyzing the memory of BLSTM neural networks for enhanced emotion classification in dyadic spoken interactions. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 4157--4160.Google ScholarGoogle ScholarCross RefCross Ref
  47. Martin Wollmer, Björn Schuller, Florian Eyben, and Gerhard Rigoll . 2010. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE Journal of Selected Topics in Signal Processing Vol. 4, 5 (2010), 867--881.Google ScholarGoogle ScholarCross RefCross Ref
  48. Xinzhou Xu, Jun Deng, Maryna Gavryukova, Zixing Zhang, Li Zhao, and Björn Schuller . 2016. Multiscale kernel locally penalised discriminant analysis exemplified by emotion recognition in speech. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 233--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Yi-Hsuan Yang and Homer H Chen . 2011. Prediction of the distribution of perceived music emotions using discrete samples. IEEE Transactions on Audio, Speech, and Language Processing Vol. 19, 7 (2011), 2184--2196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Biqiao Zhang, Georg Essl, and Emily Mower Provost . 2017. Predicting the distribution of emotion perception: capturing inter-rater variability Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 51--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Biqiao Zhang, Emily Mower Provost, Robert Swedberg, and Georg Essl . 2015. Predicting Emotion Perception Across Domains: A Study of Singing and Speaking. AAAI. 1328--1335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Shiliang Zhang, Qi Tian, Shuqiang Jiang, Qingming Huang, and Wen Gao . 2008. Affective MTV analysis based on arousal and valence features Multimedia and Expo, 2008 IEEE International Conference on. IEEE, 1369--1372.Google ScholarGoogle Scholar

Index Terms

  1. Joint Discrete and Continuous Emotion Prediction Using Ensemble and End-to-End Approaches

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction
        October 2018
        687 pages
        ISBN:9781450356923
        DOI:10.1145/3242969

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 October 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ICMI '18 Paper Acceptance Rate63of149submissions,42%Overall Acceptance Rate453of1,080submissions,42%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader