research-article

Joint Discrete and Continuous Emotion Prediction Using Ensemble and End-to-End Approaches

Authors:
Ehab A. AlBadawy

University at Albany, SUNY, Albany, NY, USA

University at Albany, SUNY, Albany, NY, USA
View Profile

,
Yelin Kim

University at Albany, SUNY, Albany, NY, USA

University at Albany, SUNY, Albany, NY, USA
View Profile

ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal InteractionOctober 2018Pages 366–375https://doi.org/10.1145/3242969.3242972

Published:02 October 2018Publication History

ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

Pages 366–375

ABSTRACT

This paper presents a novel approach in continuous emotion prediction that characterizes dimensional emotion labels jointly with continuous and discretized representations. Continuous emotion labels can capture subtle emotion variations, but their inherent noise often has negative effects on model training. Recent approaches found a performance gain when converting the continuous labels into a discrete set (e.g., using k-means clustering), despite a label quantization error. To find the optimal trade-off between the continuous and discretized emotion representations, we investigate two joint modeling approaches: ensemble and end-to-end. The ensemble model combines the predictions from two models that are trained separately, one with discretized prediction and the other with continuous prediction. On the other hand, the end-to-end model is trained to simultaneously optimize both discretized and continuous prediction tasks in addition to the final combination between them. Our experimental results using the state-of-the-art deep BLSTM network on the RECOLA dataset demonstrate that (i) the joint representation outperforms both individual representation baselines and the state-of-the-art speech based results on RECOLA, validating the assumption that combining continuous and discretized emotion representations yields better performance in emotion prediction; and (ii) the joint representation can help to accelerate convergence, particularly for valence prediction. Our work provides insights into joint discrete and continuous emotion representation and its efficacy for describing dynamically changing affective behavior in valence and activation prediction.

References

Mart'ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et almbox. . 2016. TensorFlow: A System for Large-Scale Machine Learning. OSDI, Vol. Vol. 16. 265--283. Google ScholarDigital Library
Kevin Brady, Youngjune Gwon, Pooya Khorrami, Elizabeth Godoy, William Campbell, Charlie Dagli, and Thomas S Huang . 2016. Multi-modal audio, video and physiological sensor learning for continuous emotion prediction. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 97--104. Google ScholarDigital Library
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan . 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation Vol. 42, 4 (2008), 335.Google Scholar
Linlin Chao, Jianhua Tao, Minghao Yang, Ya Li, and Zhengqi Wen . 2015. Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 65--72. Google ScholarDigital Library
Ira Cohen, Ashutosh Garg, Thomas S Huang, et almbox. . 2000. Emotion recognition from facial expressions using multilevel HMM Neural information processing systems, Vol. Vol. 2. Citeseer.Google Scholar
Li Deng and John C Platt . 2014. Ensemble deep learning for speech recognition. In Fifteenth Annual Conference of the International Speech Communication Association.Google Scholar
Beno^ıt Frénay and Michel Verleysen . 2014. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems Vol. 25, 5 (2014), 845--869.Google Scholar
Sayan Ghosh, Eugene Laksana, Louis-Philippe Morency, and Stefan Scherer . 2016. Representation Learning for Speech Emotion Recognition. INTERSPEECH. 3603--3607.Google Scholar
Xavier Glorot and Yoshua Bengio . 2010. Understanding the difficulty of training deep feedforward neural networks Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249--256.Google Scholar
Hatice Gunes and Björn Schuller . 2013. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image and Vision Computing Vol. 31, 2 (2013), 120--136. Google ScholarDigital Library
Jing Han, Zixing Zhang, Maximilian Schmitt, Maja Pantic, and Björn Schuller . 2017. From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 890--897. Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Lang He, Dongmei Jiang, Le Yang, Ercheng Pei, Peng Wu, and Hichem Sahli . 2015. Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 73--80. Google ScholarDigital Library
Sepp Hochreiter and Jürgen Schmidhuber . 1997. Long short-term memory. Neural computation Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
Zhengwei Huang, Ming Dong, Qirong Mao, and Yongzhao Zhan . 2014. Speech emotion recognition using CNN. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 801--804. Google ScholarDigital Library
Heysem Kaya, Furkan Gürpınar, and Albert Ali Salah . 2017. Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image and Vision Computing Vol. 65 (2017), 66--75. Google ScholarDigital Library
Soheil Khorram, Zakaria Aldeneh, Dimitrios Dimitriadis, Melvin McInnis, and Emily Mower Provost . 2017. Capturing Long-term Temporal Dependencies with Convolutional Networks for Continuous Emotion Recognition. arXiv preprint arXiv:1708.07050 (2017).Google Scholar
I Lawrence and Kuei Lin . 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics (1989), 255--268.Google Scholar
Duc Le, Zakaria Aldeneh, and Emily Mower Provost . 2017. Discretized continuous speech emotion recognition with multi-task deep recurrent neural network. Interspeech, 2017 (to apear) (2017).Google ScholarCross Ref
Gil Levi and Tal Hassner . 2015. Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In Proceedings of the 2015 ACM on international conference on multimodal interaction. ACM, 503--510. Google ScholarDigital Library
Paula Lopez-Otero, Laura Docio-Fernandez, and Carmen Garcia-Mateo . 2014. iVectors for continuous emotion recognition. Training Vol. 45 (2014), 50.Google Scholar
Xingchen Ma, Hongyu Yang, Qiang Chen, Di Huang, and Yunhong Wang . 2016. Depaudionet: An efficient deep model for audio based depression classification Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 35--42. Google ScholarDigital Library
Qirong Mao, Ming Dong, Zhengwei Huang, and Yongzhao Zhan . 2014. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia Vol. 16, 8 (2014), 2203--2213.Google ScholarCross Ref
Soroosh Mariooryad and Carlos Busso . 2015. Correcting time-continuous emotional labels by modeling the reaction lag of evaluators. IEEE Transactions on Affective Computing Vol. 6, 2 (2015), 97--108.Google ScholarCross Ref
Gary McKeown, Michel Valstar, Roddy Cowie, Maja Pantic, and Marc Schroder . 2012. The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing Vol. 3, 1 (2012), 5--17. Google ScholarDigital Library
Hongying Meng and Nadia Bianchi-Berthouze . 2014. Affective state level recognition in naturalistic facial and vocal expressions. IEEE Transactions on Cybernetics Vol. 44, 3 (2014), 315--328.Google ScholarCross Ref
Angeliki Metallinou, Martin Wöllmer, Athanasios Katsamanis, Florian Eyben, Björn Schuller, and Shrikanth Narayanan . 2012. Context-sensitive learning for enhanced audiovisual emotion classification. Affective Computing, IEEE Transactions on Vol. 3, 2 (2012), 184--198. Google ScholarDigital Library
Donn Morrison, Ruili Wang, and Liyanage C De Silva . 2007. Ensemble methods for spoken emotion recognition in call-centres. Speech communication Vol. 49, 2 (2007), 98--112. Google ScholarDigital Library
Tin Lay Nwe, Say Wei Foo, and Liyanage C De Silva . 2003. Speech emotion recognition using hidden Markov models. Speech communication Vol. 41, 4 (2003), 603--623.Google Scholar
Jonathan Posner, James A Russell, and Bradley S Peterson . 2005. The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology. Development and psychopathology Vol. 17, 3 (2005), 715--734.Google Scholar
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et almbox. . 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.Google Scholar
Filip Povolny, Pavel Matejka, Michal Hradis, Anna Popková, Lubom'ır Otrusina, Pavel Smrz, Ian Wood, Cecile Robin, and Lori Lamel . 2016. Multimodal emotion recognition for AVEC 2016 challenge Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 75--82. Google ScholarDigital Library
Fabien Ringeval, Björn Schuller, Michel Valstar, Shashank Jaiswal, Erik Marchi, Denis Lalanne, Roddy Cowie, and Maja Pantic . 2015. Av$^Google Scholar
$ec 2015: The first affect recognition challenge bridging across audio, video, and physiological data. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 3--8. Google ScholarDigital Library
Fabien Ringeval, Andreas Sonderegger, Juergen Sauer, and Denis Lalanne . 2013. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on. IEEE, 1--8.Google ScholarCross Ref
Marc Schroder, Elisabetta Bevacqua, Roddy Cowie, Florian Eyben, Hatice Gunes, Dirk Heylen, Mark Ter Maat, Gary McKeown, Sathish Pammi, Maja Pantic, et almbox. . 2012. Building autonomous sensitive artificial listeners. IEEE Transactions on Affective Computing Vol. 3, 2 (2012), 165--183. Google ScholarDigital Library
Marc Schröder, Sathish Pammi, Hatice Gunes, Maja Pantic, Michel F Valstar, Roddy Cowie, Gary McKeown, Dirk Heylen, Mark Ter Maat, Florian Eyben, et almbox. . 2011. Come and have an emotional workout with sensitive artificial listeners! Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on. IEEE, 646--646.Google Scholar
Björn Schuller, Stephan Reiter, Ronald Muller, Marc Al-Hames, Manfred Lang, and Gerhard Rigoll . 2005. Speaker independent speech emotion recognition by ensemble classification Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on. IEEE, 864--867.Google Scholar
Björn Schuller, Gerhard Rigoll, and Manfred Lang . 2003. Hidden Markov model-based speech emotion recognition Multimedia and Expo, 2003. ICME'03. Proceedings. 2003 International Conference on, Vol. Vol. 1. IEEE, I--401. Google ScholarDigital Library
Mohammad Soleymani, Sander Koelstra, Ioannis Patras, and Thierry Pun . 2011. Continuous emotion detection in response to music videos Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on. IEEE, 803--808.Google Scholar
Yang Sun, Louis Ten Bosch, and Lou Boves . 2010. Hybrid HMM/BLSTM-RNN for robust speech recognition International Conference on Text, Speech and Dialogue. Springer, 400--407. Google ScholarDigital Library
George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou . 2016. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 5200--5204.Google ScholarDigital Library
Panagiotis Tzirakis, George Trigeorgis, Mihalis A Nicolaou, Björn W Schuller, and Stefanos Zafeiriou . 2017. End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing Vol. 11, 8 (2017), 1301--1309.Google ScholarCross Ref
Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic . 2016. Avec 2016: Depression, mood, and emotion recognition workshop and challenge Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 3--10. Google ScholarDigital Library
Martin Wöllmer, Florian Eyben, Stephan Reiter, Björn Schuller, Cate Cox, Ellen Douglas-Cowie, and Roddy Cowie . 2008. Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies. In Proc. 9th Interspeech 2008 incorp. 12th Australasian Int. Conf. on Speech Science and Technology SST 2008, Brisbane, Australia. 597--600.Google Scholar
Martin Wöllmer, Angeliki Metallinou, Nassos Katsamanis, Björn Schuller, and Shrikanth Narayanan . 2012. Analyzing the memory of BLSTM neural networks for enhanced emotion classification in dyadic spoken interactions. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 4157--4160.Google ScholarCross Ref
Martin Wollmer, Björn Schuller, Florian Eyben, and Gerhard Rigoll . 2010. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE Journal of Selected Topics in Signal Processing Vol. 4, 5 (2010), 867--881.Google ScholarCross Ref
Xinzhou Xu, Jun Deng, Maryna Gavryukova, Zixing Zhang, Li Zhao, and Björn Schuller . 2016. Multiscale kernel locally penalised discriminant analysis exemplified by emotion recognition in speech. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 233--237. Google ScholarDigital Library
Yi-Hsuan Yang and Homer H Chen . 2011. Prediction of the distribution of perceived music emotions using discrete samples. IEEE Transactions on Audio, Speech, and Language Processing Vol. 19, 7 (2011), 2184--2196. Google ScholarDigital Library
Biqiao Zhang, Georg Essl, and Emily Mower Provost . 2017. Predicting the distribution of emotion perception: capturing inter-rater variability Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 51--59. Google ScholarDigital Library
Biqiao Zhang, Emily Mower Provost, Robert Swedberg, and Georg Essl . 2015. Predicting Emotion Perception Across Domains: A Study of Singing and Speaking. AAAI. 1328--1335. Google ScholarDigital Library
Shiliang Zhang, Qi Tian, Shuqiang Jiang, Qingming Huang, and Wen Gao . 2008. Affective MTV analysis based on arousal and valence features Multimedia and Expo, 2008 IEEE International Conference on. IEEE, 1369--1372.Google Scholar

Index Terms

Joint Discrete and Continuous Emotion Prediction Using Ensemble and End-to-End Approaches
1. Computing methodologies
  1. Artificial intelligence
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction
AVEC '16: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge

The automatic determination of emotional state from multimedia content is an inherently challenging problem with a broad range of applications including biomedical diagnostics, multimedia retrieval, and human computer interfaces. The Audio Video Emotion ...
Read More
Speech-based Continuous Emotion Prediction by Learning Perception Responses related to Salient Events: A Study based on Vocal Affect Bursts and Cross-Cultural Affect in AVEC 2018
AVEC'18: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop

This paper presents a novel framework for speech-based continuous emotion prediction. The proposed model characterises the perceived emotion estimation as time-invariant responses to salient events. Then arousal and valence variation over time is ...
Read More
End-to-End Continuous Emotion Recognition from Video Using 3D Convlstm Networks
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Conventional continuous emotion recognition consists of feature extraction step followed by regression step. However, the objective of the two steps is not consistent as they are parted. Besides, there is still no consensus about appropriate emotional ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction
October 2018
687 pages
ISBN:9781450356923
DOI:10.1145/3242969
General Chairs:
Sidney K. D'Mello
University of Illinois, USA
,
Panayiotis (Panos) Georgiou
University of Southern California, USA
,
Stefan Scherer
University of Southern California, USA
,
Program Chairs:
Emily Mower Provost
University of Michigan, USA
,
Mohammad Soleymani
University of Southern California, USA
,
Marcelo Worsley
Northwestern University, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bidirectional long- short-term memory
continuous emotion prediction
emotion recognition
joint representation
Qualifiers
- research-article
Conference

Acceptance Rates
ICMI '18 Paper Acceptance Rate63of149submissions,42%Overall Acceptance Rate453of1,080submissions,42%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 260
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Joint Discrete and Continuous Emotion Prediction Using Ensemble and End-to-End Approaches

ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction

Speech-based Continuous Emotion Prediction by Learning Perception Responses related to Salient Events: A Study based on Vocal Affect Bursts and Cross-Cultural Affect in AVEC 2018

End-to-End Continuous Emotion Recognition from Video Using 3D Convlstm Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Joint Discrete and Continuous Emotion Prediction Using Ensemble and End-to-End Approaches

ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction

Speech-based Continuous Emotion Prediction by Learning Perception Responses related to Salient Events: A Study based on Vocal Affect Bursts and Cross-Cultural Affect in AVEC 2018

End-to-End Continuous Emotion Recognition from Video Using 3D Convlstm Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media