research-article

Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video

Authors:
Linlin Chao

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China
View Profile

,
Jianhua Tao

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China
View Profile

,
Minghao Yang

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China
View Profile

,
Ya Li

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China
View Profile

,
Zhengqi Wen

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China
View Profile

AVEC '14: Proceedings of the 4th International Workshop on Audio/Visual Emotion ChallengeNovember 2014Pages 11–18https://doi.org/10.1145/2661806.2661811

Published:07 November 2014Publication History

AVEC '14: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge

Pages 11–18

ABSTRACT

Understanding nonverbal behaviors in human machine interaction is a complex and challenge task. One of the key aspects is to recognize human emotion states accurately. This paper presents our effort to the Audio/Visual Emotion Challenge (AVEC'14), whose goal is to predict the continuous values of the emotion dimensions arousal, valence and dominance at each moment in time. The proposed method utilizes deep belief network based models to recognize emotion states from audio and visual modalities. Firstly, we employ temporal pooling functions in the deep neutral network to encode dynamic information in the features, which achieves the first time scale temporal modeling. Secondly, we combine the predicted results from different modalities and emotion temporal context information simultaneously. The proposed multimodal-temporal fusion achieves temporal modeling for the emotion states in the second time scale. Experiments results show the efficiency of each key point of the proposed method and competitive results are obtained

References

J. Tao and T. Tan, Affective Computing: A Review, Proc. First Int'l Conf. Affective Computing and Intelligent Interaction, J. Tao, T. Tan, and R.W. Picard, eds., pp. 981--995, 2005. Google ScholarDigital Library
A. Dhall, R. Goecke, J. Joshi, M. Wagner, and T. Gedeon, Emotion recognition in the wild challenge 2013. In ACM International Conference on Multimodal Interaction, 2013. Google ScholarDigital Library
H. Gunes, M. Pantic. Automatic, dimensional and continuous emotion recognition{J}. International Journal of Synthetic Emotions (IJSE), 2010, 1(1): 68--99. Google ScholarDigital Library
A. Mehrabian, and J. Russell, An approach to environmental psychology. Cambridge, MA: MIT Press,Google Scholar
J. Davitz, Auditory correlates of vocal expression of emotional feeling. In J. Davitz (Ed.), The communication of emotional meaning (pp. 101--112).New York: McGraw-Hill, 1964.Google Scholar
P. Ekman and E. L. Rosenberg, What the Face Reveals : Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System, seconded. Oxford Univ. Press, 2005.Google Scholar
K. R. Scherer, Appraisal Theory, Handbook of Cognition and Emotion, T. Dalgleish and M. J. Power, eds., pp. 637--663, Wiley,1999.Google Scholar
N. Sebe, I. Cohen, T. Gevers, and T. S. Huang. Multimodal approaches for emotion recognition: a survey. In S. Santini, R. Schettini, and T. Gevers, editors, Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 5670 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, pages56--67, Dec.2004.Google Scholar
T. Balomenos, A. Raouzaiou, S. Ioannou, A. Drosopoulos, K. Karpouzis, and S. Kollias. Emotion analysis in man machine interaction systems. In in Proc. MLMI, LNCS3361, pages 318--328, 2004. Google ScholarDigital Library
R. Lane and L. Nadel, Cognitive Neuroscience of Emotion. Oxford Univ. Press, 2000.Google Scholar
P. A. Lewisetal , Neural correlates of processing valence and arousal in affective words, CerebralCortex,vol.17,no.3, pp. 742--748, Mar 2007.Google Scholar
M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, and M. Pantic, AVEC 2014 -- 3D Dimensional Affect and Depression Recognition Challenge, proc. 4th ACM international workshop on Audio/visual emotion challenge, 2014. Google ScholarDigital Library
H. Meng, N. Bianchi-Berthouze, Naturalistic Affective Expression Classification by a Multi-stage Approach Based on Hidden Markov Models, In Affective Computing and Intelligent Interaction (pp. 378--387). Springer Berlin Heidelberg, 2011. Google ScholarDigital Library
M. Glodek, S. Tschechne, G. Layher, M. Schels, T. Brosch, S. Scherer, F. Schwenker, Multiple Classifier Systems for the Classification of Audio-Visual Emotion States, In Affective Computing and Intelligent Interaction (pp. 378--387). Springer Berlin Heidelberg, 2011. Google ScholarDigital Library
G. A. Ramirez, T. Baltrušaitis, and L. P. Morency, Modeling latent discriminative dynamic of multi-dimensional affective signals. In Affective Computing and Intelligent Interaction (pp. 396--406). Springer Berlin Heidelberg, 2011. Google ScholarDigital Library
M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller, LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework, Image and Vision Computing, 2012.Google Scholar
M. Wöllmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, R. Cowie, Abandoning emotion classes -- towards continuous emotion recognition with modeling of long-range dependences, In Proc. Interspeech, pp. 597--600, 2008.Google Scholar
M. Wöllmer, B. Schuller, F. Eyben, G. Rigoll, Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening, IEEE Journal of Selected Topics in Signal Processing (J-STSP), Special Issue on Speech Processing for Natural Interaction with Intelligent Environments, Vol 4, Issue 5, 867--881, 2010Google Scholar
J. Nicolle, V. Rapp, K. Bailly, L. Prevost and M. Chetouani, Robust continuous prediction of human emotions using multi-scale dynamic cues, In Proceedings of the 14th ACM international conference on Multimodal interaction, ACM, pp.477--484,2012. Google ScholarDigital Library
H. Meng, D. Huang, H. Wang, H. Yang, M. Al-Shuraifi and Y. Wang, Depression Recognition based on Dynamic Facial and Vocal Expression Features using Partial Least Square Regression, proc. 3rd ACM international workshop on Audio/visual emotion challenge, 2013. Google ScholarDigital Library
H. Gunes, M. Piccardi, and M. Pantic, From the Lab to the Real World: Affect Recognition Usng, Affective Computing: Focus on Emotion Expression, Synthesis, and Recognition. I-Tech Education and Publishing, Vienna, Austria, pp. 185 - 218, 2008.Google Scholar
M. A. Nicolaou, H. Gunes, and M. Pantic, Output-associative RVM regression for dimensional and continuous emotion prediction. Image and Vision Computing, 30(3), 186--196, (2012). Google ScholarDigital Library
A. Stuhlsatz, C. Meyer, F. Eyben, T. ZieIke, G. Meier, and B. Schuller, Deep neural networks for acoustic emotion recognition: raising the benchmarks. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on (pp. 5688--5691). IEEE.Google Scholar
S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, C. Gülçehre, *, R. Memisevic, P. Vincent, A. Courville, and Y. Bengio, Combining Modality Specific Deep Neural Networks for Emotion Recognition in Video. In Proceedings of the 15th ACM International Conference on Multimodal Interaction (ICMI '13) pp. 543--550. Google ScholarDigital Library
D. Le and E. M. Provost. Emotion Recognition from Spontaneous Speech using Hidden Markov Models with Deep Belief Networks, Automatic Speech Recognition and Understanding (ASRU). Olomouc, Czech Republic. December, 2013.Google Scholar
Y. Kim, H. Lee, and E. M. Provost, Deep Learning for Robust Feature Generation in Audio-Visual Emotion Recognition, International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Vancouver, British Columbia, Canada. May, 2013.Google Scholar
P. Hamel, S. Lemieux, Y. Bengio, and D. Eck, Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In ISMIR (pp. 729--734), 2011.Google Scholar
M. Valstar, K. Smith, F. Eyben, S. Schnieder, and R. Cowie. Avec 2013-the continuous audio/visual emotion and depression recognition challenge. In Proc.3rd ACM international workshop on Audio/visual emotion challenge, pages3--10, 2013. Google ScholarDigital Library
B. Mathieu, S. Essid, T. Fillon, J. Prado, G. Richard, YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software, proceedings of the 11th ISMIR conference, Utrecht, Netherlands, 2010.Google Scholar
X. Xiong and F. De la Torre, Supervised descent method and its applications to face alignment{C}//Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013: 532--539. Google ScholarDigital Library
P. Viola and M. Jones, Robust real time object detection {J}, International Journal of Computer Vision (IJCV), 2001.Google Scholar
A. Coates, H. Lee, and A. Y. Ng. An Analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.Google Scholar
Y. Bengio, Deep learning of representations for unsupervised and transfer learning, ICML Unsupervised and Transfer Learning, 2012: 17--36Google Scholar
L. Chao, J. Tao and M. Yang, Combining Emotional History Through Multimodal Fusion Methods. Asia Pacific Signal and Information Processing Association (APSIPA 2013), Oct.29-Nov.1 2013 .Google Scholar
C. Soladié, H. Salam, C. Pelachaud, N. Stoiber and R. Séguier, A Multimodal Fuzzy Inference System using a Continuous Facial Expression Representation for Emotion Detection, In Proceedings of the 14th ACM international conference on Multimodal interaction, ACM, 2012. Google ScholarDigital Library
G. E. Hinton, and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006.Google Scholar
S. Dobrisek, R. Gajsek, F. Mihelic, N. Pavesic and V. Struc, Towards efficient multi-modal emotion recognition. Int J Adv Robot Syst 10:1--10, 2013.Google ScholarCross Ref

Index Terms

Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video
1. Applied computing
  1. Law, social and behavioral sciences
    1. Psychology
2. Hardware
  1. Communication hardware, interfaces and storage
    1. Signal processing systems

Recommendations

Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition
AVEC '15: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge

This paper presents our effort to the Audio/Visual+ Emotion Challenge (AV+EC2015), whose goal is to predict the continuous values of the emotion dimensions arousal and valence from audio, visual and physiology modalities. The state of art classifier for ...
Read More
AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge
AVEC '14: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge

Mood disorders are inherently related to emotion. In particular, the behaviour of people suffering from mood disorders such as unipolar depression shows a strong temporal correlation with the affective dimensions valence, arousal and dominance. In ...
Read More
AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge
AVEC '16: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge

The Audio/Visual Emotion Challenge and Workshop (AVEC 2016) "Depression, Mood and Emotion" will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AVEC '14: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge
November 2014
110 pages
ISBN:9781450331197
DOI:10.1145/2661806
General Chairs:
Michel Valstar
University of Nottingham, UK
,
Björn Schuller
Technische Universität Münich/Imperial College London, DE/UK
,
Jarek Krajewski
University of Wuppertal, Germany
,
Roddy Cowie
Queen's University Belfast, UK
,
Maja Pantic
Imperial College London/Twente University, UK/The Netherlands
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 November 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
affective computing
challenge
emotion recognition
facial expression
speech
Qualifiers
- research-article
Conference

Acceptance Rates
AVEC '14 Paper Acceptance Rate8of22submissions,36%Overall Acceptance Rate52of98submissions,53%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 567
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video

AVEC '14: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge

ABSTRACT

References

Cited By

Index Terms

Recommendations

Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition

AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge

AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video

AVEC '14: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge

ABSTRACT

References

Cited By

Index Terms

Recommendations

Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition

AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge

AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media