poster

Speaker-adaptive multimodal prediction model for listener responses

Authors:
Iwan de Kok

University of Twente, Enschede, Netherlands

University of Twente, Enschede, Netherlands
View Profile

,
Dirk Heylen

University of Twente, Enschede, Netherlands

University of Twente, Enschede, Netherlands
View Profile

,
Louis-Philippe Morency

USC Institute for Creative Technologies, Los Angeles, USA

USC Institute for Creative Technologies, Los Angeles, USA
View Profile

ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interactionDecember 2013Pages 51–58https://doi.org/10.1145/2522848.2522866

Published:09 December 2013Publication History

ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interaction

Pages 51–58

ABSTRACT

The goal of this paper is to analyze and model the variability in speaking styles in dyadic interactions and build a predictive algorithm for listener responses that is able to adapt to these different styles. The end result of this research will be a virtual human able to automatically respond to a human speaker with proper listener responses (e.g., head nods). Our novel speaker-adaptive prediction model is created from a corpus of dyadic interactions where speaker variability is analyzed to identify a subset of prototypical speaker styles. During a live interaction our prediction model automatically identifies the closest prototypical speaker style and predicts listener responses based on this ``communicative style". Central to our approach is the idea of ``speaker profile" which uniquely identifies each speaker and enables the matching between prototypical speakers and new speakers. The paper shows the merits of our speaker-adaptive listener response prediction model by showing improvement over a state-of-the-art approach which does not adapt to the speaker. Besides the merits of speaker-adapta-tion, our experiments highlights the importance of using multimodal features when comparing speakers to select the closest prototypical speaker style.

References

hCRF library. http://sourceforge.net/projects/hcrf/.Google Scholar
M. Argyle, R. Ingham, F. Alkema, and M. McCallin. The different functions of gaze. Semiotica, 7(1):19--32, 1973.Google ScholarCross Ref
J. B. Bavelas, L. Coates, and T. Johnson. Listeners as co-narrators. Journal of Personality and Social Psychology, 79(6):941--952, 2000.Google ScholarCross Ref
J. B. Bavelas, L. Coates, and T. Johnson. Listener responses as a collaborative process: The role of gaze. Journal of Communication, 52(3):566--580, 2002.Google ScholarCross Ref
N. Cathcart, J. Carletta, and E. Klein. A shallow model of backchannel continuers in spoken dialogue. European ACL, pages 51--58, 2003. Google ScholarDigital Library
T. L. Chartrand and J. A. Bargh. The chameleon effect: the perception-behavior link and social interaction. Journal of personality and social psychology, 76(6):893--910, 1999.Google Scholar
I. de Kok and D. Heylen. Appropriate and Inappropriate Timing of Listener Responses from Multiple Perspectives. In Intelligent Virtual Agents, pages 248--254, 2011. Google ScholarDigital Library
I. de Kok and D. Heylen. The MultiLis Corpus - Dealing with Individual Differences of Nonverbal Listening Behavior. In Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues, pages 374--387. Springer Verlag, 2011. Google ScholarDigital Library
I. de Kok and D. Heylen. A survey on evaluation metrics for backchannel prediction models. In Feedback Behaviors in Dialog, pages 15--18, 2012.Google Scholar
I. de Kok and D. Heylen. Controlling the Listener Response Rate. In Intelligent Virtual Agents, pages 168--179, 2013.Google ScholarCross Ref
I. de Kok, D. Ozkan, D. Heylen, and L.-P. Morency. Learning and Evaluating Response Prediction Models using Parallel Listener Consensus. In Proceeding of International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, 2010. Google ScholarDigital Library
A. T. Dittmann and L. G. Llewellyn. Relationship between vocalizations and head nods as listener responses. Journal of Personality and Social Psychology, 9(1):79--84, 1968.Google ScholarCross Ref
T. Drugman and A. Alwan. Joint robust voicing detection and pitch estimation based on residual harmonics. In Interspeech, pages 1973--1976, 2011.Google Scholar
C. Goodwin. Conversational Organization: interaction between speakers and hearers. Academic Press, 1981.Google Scholar
J. Gratch, N. Wang, J. Gerten, E. Fast, and R. Duffy. Creating rapport with virtual agents. In Intelligent Virtual Agents, pages 125--138, 2007. Google ScholarDigital Library
A. Gravano and J. Hirschberg. Backchannel-Inviting Cues in Task-Oriented Dialogue. In Interspeech 2009, pages 1019--1022, 2009.Google Scholar
S. W. Gregory Jr. and B. R. Hoyt. Conversation partner mutual adaptation as demonstrated by Fourier series analysis. Journal of Psycholinguistic Research, 11(1):35--46, Jan. 1982.Google ScholarCross Ref
L. Huang, L.-P. Morency, and J. Gratch. Parasocial Consensus Sampling: Combining Multiple Perspectives to Learn Virtual Human Behavior. In Proceedings of Autonomous Agents and Multi-Agent Systems, pages 1265--1272, 2010. Google ScholarDigital Library
M. Huijbregts. Segmentation , Diarization and Speech Transcription : Surprise Data Unraveled. Phd thesis, University of Twente, 2008.Google Scholar
S.-H. Kang, J. Gratch, N. Wang, and J. H. Watt. Does the contingency of agents' nonverbal feedback affect users' social anxiety? In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 1, number Aamas, pages 120--127, 2008. Google ScholarDigital Library
A. Kendon. Some functions of gaze direction in social interaction. Acta Psychologica, 26:22--63, 1967.Google ScholarCross Ref
H. Koiso, Y. Horiuchi, S. Tutiya, A. Ichikawa, and Y. Den. An Analysis of Turn-Taking and Backchannels Based on Prosodic and Syntactic Features in Japanese Map Task Dialogs. Language and Speech, 41(3--4):295--321, 1998.Google Scholar
R. E. Kraut, S. H. Lewis, and L. W. Swezey. Listener responsiveness and the coordination of conversation. Journal of Personality and Social Psychology, 43(4):718--731, 1982.Google ScholarCross Ref
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of International Conference on Machine Learning, pages 282--289, 2001. Google ScholarDigital Library
R. M. Maatman, J. Gratch, and S. Marsella. Natural behavior of a listening agent. In Intelligent Virtual Agents, pages 25--36, 2005. Google ScholarDigital Library
L.-P. Morency, I. de Kok, and J. Gratch. A probabilistic multimodal approach for predicting listener backchannels. Autonomous Agents and Multi-Agent Systems, 20(1):70--84, 2011. Google ScholarDigital Library
Y. Okato, K. Kato, M. Kamamoto, and S. Itahashi. Insertion of interjectory response based on prosodic information. Proceedings of IVTTA '96. Workshop on Interactive Voice Technology for Telecommunications Applications, pages 85--88, 1996.Google ScholarCross Ref
D. Ozkan and L.-P. Morency. Modeling wisdom of crowds using latent mixture of discriminative experts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011. Google ScholarDigital Library
C. Sun and L.-P. Morency. Dialogue act recognition using reweighted speaker adaptation. In Proceedings of the 13th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL '12), pages 118--125, 2012. Google ScholarDigital Library
N. Ward and W. Tsukahara. Prosodic features which cue back-channel responses in English and Japanese. Journal of Pragmatics, 32(8):1177--1207, 2000.Google ScholarCross Ref
T. Watanabe and N. Yuuki. A Voice Reaction System with a Visualized Response Equivalent to Nodding. In Proceedings of the third international conference on human-computer interaction, Vol.1 on Work with computers: organizational, management, stress and health aspects, pages 396--403, 1989. Google ScholarDigital Library

Index Terms

Speaker-adaptive multimodal prediction model for listener responses
1. Computing methodologies
  1. Artificial intelligence
    1. Distributed artificial intelligence
      1. Intelligent agents
    2. Natural language processing
      1. Discourse, dialogue and pragmatics

Recommendations

Multimodal and Multitask Approach to Listener's Backchannel Prediction: Can Prediction of Turn-changing and Turn-management Willingness Improve Backchannel Modeling?
IVA '21: Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents

The listener's backchannel has the important function of encouraging a current speaker to hold their turn and continue to speak, which enables smooth conversation. The listener monitors the speaker's turn-management (a.k.a. speaking and listening) ...
Read More
A probabilistic multimodal approach for predicting listener backchannels

During face-to-face interactions, listeners use backchannel feedback such as head nods as a signal to the speaker that the communication is working and that they should continue speaking. Predicting these backchannel opportunities is an important ...
Read More
Multimodal speaker clustering in full length movies

Multimodal clustering/diarization tries to answer the question "who spoke when" by using audio and visual information. Diarizationconsists of two steps, at first segmentation of the audio information and detection of the speech segments and then ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interaction
December 2013
630 pages
ISBN:9781450321297
DOI:10.1145/2522848
General Chairs:
Julien Epps
The University of New South Wales, Australia
,
Fang Chen
National ICT Australia, Australia
,
Sharon Oviatt
Incaa Designs, USA
,
Kenji Mase
Nagoya University, Japan
,
Program Chairs:
Andrew Sears
Rochester Institute of Technology, USA
,
Kristiina Jokinen
University of Helsinki, Finland
,
Björn Schuller
Technische Universität München, Germany
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 December 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
listener response
machine learning
multimodal
social behavior
Qualifiers
- poster
Conference

Acceptance Rates
ICMI '13 Paper Acceptance Rate49of133submissions,37%Overall Acceptance Rate453of1,080submissions,42%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 140
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Speaker-adaptive multimodal prediction model for listener responses

ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multimodal and Multitask Approach to Listener's Backchannel Prediction: Can Prediction of Turn-changing and Turn-management Willingness Improve Backchannel Modeling?

A probabilistic multimodal approach for predicting listener backchannels

Multimodal speaker clustering in full length movies