ABSTRACT
When humans converse with each other, they naturally amalgamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a model that resolves references to visual (screen) elements in a conversational web browsing system. The system detects eye gaze, recognizes speech, and then interprets the user's browsing intent (e.g., click on a specific element) through a combination of spoken language understanding and eye gaze tracking. We experiment with multi-turn interactions collected in a wizard-of-Oz scenario where users are asked to perform several web-browsing tasks. We compare several gaze features and evaluate their effectiveness when combined with speech-based lexical features. The resulting multi-modal system not only increases user intent (turn) accuracy by 17%, but also resolves the referring expression ambiguity commonly observed in dialog systems with a 10% increase in F-measure.
- R. A. Bolt. "put-that-there": Voice and gesture at the graphics interface. volume 14. ACM, 1980. Google ScholarDigital Library
- A. Celikyilmaz, Z. Feizollahi, D. Hakkani-Tür, and R. Sarikaya. Resolving referring expressions in conversational dialogs for natural user interfaces. In Proceedings of EMNLP, 2014.Google ScholarCross Ref
- N. Cooke, A. Shen, and M. Russell. Exploiting a "gaze-lombard effect" to improve asr performance in acoustically noisy settings. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014.Google ScholarCross Ref
- J. H. D. D. Salvucci. Identifying fixations and saccades in eye-tracking protocols. In Eye Tracking Researchand Application, pages 71--79, 2000. Google ScholarDigital Library
- L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, and J. Williams. Recent advances in deep learning for speech research at microsoft. In Proceedings of IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), 2013.Google ScholarCross Ref
- B. Favre, D. Hakkani-Tür, and S. Cuendet. Icsiboost. http://code.google.come/p/icsiboost, 2007.Google Scholar
- L. Heck, D. Hakkani-Tür, M. Chinthakunta, G. Tur, R. Iyer, P. Parthasarathy, L. Stifelman, A. Fidler, and E. Shriberg. Multimodal conversational search and browse. In Proceedings of IEEE Workshop on Speech, Language and Audio in Multimedia, 2013.Google Scholar
- C. Kennington, S. Kousidis, and D. Schlangen. Interpreting situated dialogue utterances: an update model that uses speech, gaze, and gesture information. In Proceedings of SIGDial, 2013.Google Scholar
- T. Misu, A. Raux, I. Lane, J. Devassy, and R. Gupta. Situated multi-modal dialog system in vehicles. In Proceedings of the 6th ACM workshop on Eye gaze in intelligent human machine interaction, pages 25--28, 2013. Google ScholarDigital Library
- Z. Prasov and J. Y. Chai. What's in a gaze?: the role of eye-gaze in reference resolution in multimodal conversational interfaces. In ACM Proceedings of the 13th international conference on Intelligent user interfaces, pages 20--29, 2008. Google ScholarDigital Library
- M. Slaney, R. Rajen, A. Stolcke, and P. Parthasarathy. Gaze enhanced speech recognition. In Proceedings of IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), 2014.Google ScholarCross Ref
Index Terms
Eye Gaze for Spoken Language Understanding in Multi-modal Conversational Interactions
Recommendations
Automatic inference of cross-modal nonverbal interactions in multiparty conversations: "who responds to whom, when, and how?" from gaze, head gestures, and utterances
ICMI '07: Proceedings of the 9th international conference on Multimodal interfacesA novel probabilistic framework is proposed for analyzing cross-modal nonverbal interactions in multiparty face-to-face conversations. The goal is to determine "who responds to whom, when, and how" from multimodal cues including gaze, head gestures, and ...
Salience modeling based on non-verbal modalities for spoken language understanding
ICMI '06: Proceedings of the 8th international conference on Multimodal interfacesPrevious studies have shown that, in multimodal conversational systems, fusing information from multiple modalities together can improve the overall input interpretation through mutual disambiguation. Inspired by these findings, this paper investigates ...
Using eye movements to determine referents in a spoken dialogue system
PUI '01: Proceedings of the 2001 workshop on Perceptive user interfacesMost computational spoken dialogue systems take a "literary" approach to reference resolution. With this type of approach, entities that are mentioned by a human interactor are unified with elements in the world state based on the same principles that ...
Comments