ABSTRACT
The goal of this paper is to analyze and model the variability in speaking styles in dyadic interactions and build a predictive algorithm for listener responses that is able to adapt to these different styles. The end result of this research will be a virtual human able to automatically respond to a human speaker with proper listener responses (e.g., head nods). Our novel speaker-adaptive prediction model is created from a corpus of dyadic interactions where speaker variability is analyzed to identify a subset of prototypical speaker styles. During a live interaction our prediction model automatically identifies the closest prototypical speaker style and predicts listener responses based on this ``communicative style". Central to our approach is the idea of ``speaker profile" which uniquely identifies each speaker and enables the matching between prototypical speakers and new speakers. The paper shows the merits of our speaker-adaptive listener response prediction model by showing improvement over a state-of-the-art approach which does not adapt to the speaker. Besides the merits of speaker-adapta-tion, our experiments highlights the importance of using multimodal features when comparing speakers to select the closest prototypical speaker style.
- hCRF library. http://sourceforge.net/projects/hcrf/.Google Scholar
- M. Argyle, R. Ingham, F. Alkema, and M. McCallin. The different functions of gaze. Semiotica, 7(1):19--32, 1973.Google ScholarCross Ref
- J. B. Bavelas, L. Coates, and T. Johnson. Listeners as co-narrators. Journal of Personality and Social Psychology, 79(6):941--952, 2000.Google ScholarCross Ref
- J. B. Bavelas, L. Coates, and T. Johnson. Listener responses as a collaborative process: The role of gaze. Journal of Communication, 52(3):566--580, 2002.Google ScholarCross Ref
- N. Cathcart, J. Carletta, and E. Klein. A shallow model of backchannel continuers in spoken dialogue. European ACL, pages 51--58, 2003. Google ScholarDigital Library
- T. L. Chartrand and J. A. Bargh. The chameleon effect: the perception-behavior link and social interaction. Journal of personality and social psychology, 76(6):893--910, 1999.Google Scholar
- I. de Kok and D. Heylen. Appropriate and Inappropriate Timing of Listener Responses from Multiple Perspectives. In Intelligent Virtual Agents, pages 248--254, 2011. Google ScholarDigital Library
- I. de Kok and D. Heylen. The MultiLis Corpus - Dealing with Individual Differences of Nonverbal Listening Behavior. In Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues, pages 374--387. Springer Verlag, 2011. Google ScholarDigital Library
- I. de Kok and D. Heylen. A survey on evaluation metrics for backchannel prediction models. In Feedback Behaviors in Dialog, pages 15--18, 2012.Google Scholar
- I. de Kok and D. Heylen. Controlling the Listener Response Rate. In Intelligent Virtual Agents, pages 168--179, 2013.Google ScholarCross Ref
- I. de Kok, D. Ozkan, D. Heylen, and L.-P. Morency. Learning and Evaluating Response Prediction Models using Parallel Listener Consensus. In Proceeding of International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, 2010. Google ScholarDigital Library
- A. T. Dittmann and L. G. Llewellyn. Relationship between vocalizations and head nods as listener responses. Journal of Personality and Social Psychology, 9(1):79--84, 1968.Google ScholarCross Ref
- T. Drugman and A. Alwan. Joint robust voicing detection and pitch estimation based on residual harmonics. In Interspeech, pages 1973--1976, 2011.Google Scholar
- C. Goodwin. Conversational Organization: interaction between speakers and hearers. Academic Press, 1981.Google Scholar
- J. Gratch, N. Wang, J. Gerten, E. Fast, and R. Duffy. Creating rapport with virtual agents. In Intelligent Virtual Agents, pages 125--138, 2007. Google ScholarDigital Library
- A. Gravano and J. Hirschberg. Backchannel-Inviting Cues in Task-Oriented Dialogue. In Interspeech 2009, pages 1019--1022, 2009.Google Scholar
- S. W. Gregory Jr. and B. R. Hoyt. Conversation partner mutual adaptation as demonstrated by Fourier series analysis. Journal of Psycholinguistic Research, 11(1):35--46, Jan. 1982.Google ScholarCross Ref
- L. Huang, L.-P. Morency, and J. Gratch. Parasocial Consensus Sampling: Combining Multiple Perspectives to Learn Virtual Human Behavior. In Proceedings of Autonomous Agents and Multi-Agent Systems, pages 1265--1272, 2010. Google ScholarDigital Library
- M. Huijbregts. Segmentation , Diarization and Speech Transcription : Surprise Data Unraveled. Phd thesis, University of Twente, 2008.Google Scholar
- S.-H. Kang, J. Gratch, N. Wang, and J. H. Watt. Does the contingency of agents' nonverbal feedback affect users' social anxiety? In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 1, number Aamas, pages 120--127, 2008. Google ScholarDigital Library
- A. Kendon. Some functions of gaze direction in social interaction. Acta Psychologica, 26:22--63, 1967.Google ScholarCross Ref
- H. Koiso, Y. Horiuchi, S. Tutiya, A. Ichikawa, and Y. Den. An Analysis of Turn-Taking and Backchannels Based on Prosodic and Syntactic Features in Japanese Map Task Dialogs. Language and Speech, 41(3--4):295--321, 1998.Google Scholar
- R. E. Kraut, S. H. Lewis, and L. W. Swezey. Listener responsiveness and the coordination of conversation. Journal of Personality and Social Psychology, 43(4):718--731, 1982.Google ScholarCross Ref
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of International Conference on Machine Learning, pages 282--289, 2001. Google ScholarDigital Library
- R. M. Maatman, J. Gratch, and S. Marsella. Natural behavior of a listening agent. In Intelligent Virtual Agents, pages 25--36, 2005. Google ScholarDigital Library
- L.-P. Morency, I. de Kok, and J. Gratch. A probabilistic multimodal approach for predicting listener backchannels. Autonomous Agents and Multi-Agent Systems, 20(1):70--84, 2011. Google ScholarDigital Library
- Y. Okato, K. Kato, M. Kamamoto, and S. Itahashi. Insertion of interjectory response based on prosodic information. Proceedings of IVTTA '96. Workshop on Interactive Voice Technology for Telecommunications Applications, pages 85--88, 1996.Google ScholarCross Ref
- D. Ozkan and L.-P. Morency. Modeling wisdom of crowds using latent mixture of discriminative experts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011. Google ScholarDigital Library
- C. Sun and L.-P. Morency. Dialogue act recognition using reweighted speaker adaptation. In Proceedings of the 13th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL '12), pages 118--125, 2012. Google ScholarDigital Library
- N. Ward and W. Tsukahara. Prosodic features which cue back-channel responses in English and Japanese. Journal of Pragmatics, 32(8):1177--1207, 2000.Google ScholarCross Ref
- T. Watanabe and N. Yuuki. A Voice Reaction System with a Visualized Response Equivalent to Nodding. In Proceedings of the third international conference on human-computer interaction, Vol.1 on Work with computers: organizational, management, stress and health aspects, pages 396--403, 1989. Google ScholarDigital Library
Index Terms
- Speaker-adaptive multimodal prediction model for listener responses
Recommendations
Multimodal and Multitask Approach to Listener's Backchannel Prediction: Can Prediction of Turn-changing and Turn-management Willingness Improve Backchannel Modeling?
IVA '21: Proceedings of the 21st ACM International Conference on Intelligent Virtual AgentsThe listener's backchannel has the important function of encouraging a current speaker to hold their turn and continue to speak, which enables smooth conversation. The listener monitors the speaker's turn-management (a.k.a. speaking and listening) ...
A probabilistic multimodal approach for predicting listener backchannels
During face-to-face interactions, listeners use backchannel feedback such as head nods as a signal to the speaker that the communication is working and that they should continue speaking. Predicting these backchannel opportunities is an important ...
Multimodal speaker clustering in full length movies
Multimodal clustering/diarization tries to answer the question "who spoke when" by using audio and visual information. Diarizationconsists of two steps, at first segmentation of the audio information and detection of the speech segments and then ...
Comments