Abstract
Gesture and speech combine to form a rich basis for human conversational interaction. To exploit these modalities in HCI, we need to understand the interplay between them and the way in which they support communication. We propose a framework for the gesture research done to date, and present our work on the cross-modal cues for discourse segmentation in free-form gesticulation accompanying speech in natural conversation as a new paradigm for such multimodal interaction. The basis for this integration is the psycholinguistic concept of the coequal generation of gesture and speech from the same semantic intent. We present a detailed case study of a gesture and speech elicitation experiment in which a subject describes her living space to an interlocutor. We perform two independent sets of analyses on the video and audio data: video and audio analysis to extract segmentation cues, and expert transcription of the speech and gesture data by microanalyzing the videotape using a frame-accurate videoplayer to correlate the speech with the gestural entities. We compare the results of both analyses to identify the cues accessible in the gestural and audio data that correlate well with the expert psycholinguistic analysis. We show that "handedness" and the kind of symmetry in two-handed gestures provide effective supersegmental discourse cues.
- Ansari, R., Dai, Y., Lou, J., McNeill, D., and Quek, F. 1999. Representation of prosodic structure in speech using nonlinear methods. In Workshop on Nonlinear Signal & Image Processing (Antalya, TU, June 20--23).Google Scholar
- Boehme, H.-J., Brakensiek, A., Braumann, U.-D., Krabbes, M., and Gross, H.-M. 1997. Neural architecture for gesture-based human-machine-interaction. In Proceedings of the International Gesture Workshop: Gesture & Sign Language in HCI, I. Wachsmuth and M. Frohlich, Eds., Springer, Bielefeld, Germany, 219--232. Google Scholar
- Boersma, P. and Weenik, D. 1996. Praat, a system for doing phonetics by computer. Tech. Rep. 132, Institute of Phonetic Sciences of the University of Amsterdam.Google Scholar
- Bolt, R. A. 1980. Put-that there. Comput. Graph. 14, 262--270. Google Scholar
- Bolt, R. A. 1982. Eyes at the interface. In Proceedings of the ACM CHI Human Factors in Computing Systems Conference, 360--362. Google Scholar
- brittanica.com. Encyclopaedia brittanica web site. http://www.brittanica.com.Google Scholar
- Cohen, P., Dalrymple, M., Moran, D., Pereira, F., Sullivan, J., Gargan, R., Schlossberg, J., and Tyler, S. 1989. Synergistic use of direct manipulation and natural language. In Human Factors in Computing Systems: CHI'89 Conference Proceedings, ACM, Addison-Wesley, Reading, Mass., 227--234. Google Scholar
- Cohen, P., Dalrymple, M., Moran, D., Pereira, F., Sullivan, J., Gargan, R., Schlossberg, J., and Tyler, S. 1998. Synergistic use of direct manipulation and natural language. In Readings in Intelligent User Interfaces, M. Maybury and W. Wahlster, Eds., Morgan Kaufman, San Francisco, 29--35. Google Scholar
- Edwards, A. 1997. Progress in sign language recognition. In Proceedings of the International Gesture Workshop on Gesture & Sign Language in HCI, I. Wachsmuth and M. Frohlich, Eds., Springer, Bielefeld, Germany, 13--21. Google Scholar
- Franklin, D., Kahn, R. E., Swain, M. J., and Firby, R. J. 1996. Happy patrons make better tippers creating a robot waiter using Perseus and the animate agent architecture. In FG96 (Killington, Vt.), 253--258. Google Scholar
- Hofmann, F., Heyer, P., and Hommel, G. 1997. Velocity profile based recognition of dynamic gestures with discrete hidden Markov models. In Proceedings of the International Gesture Workshop: Gesture & Sign Language in HCI, I. Wachsmuth and M. Frohlich, Eds., Springer, Bielefeld, Germany, 81--95. Google Scholar
- Kahn, R., Swain, M., Projkopowicz, P., and Firby, R. 1994. Gesture recognition using the Perseus architecture. In Proceedings of the IEEE Conference on CVPR, IEEE Computer Society, Los Alamitos, Calif., 734--741. Google Scholar
- Kendon, A. 1986. Current issues in the study of gesture. In The Biological Foundations of Gestures: Motor and Semiotic Aspects, J.-L. Nespoulous, P. Peron, and A. Lecours, Eds., Lawrence Erlbaum, Hillsdale, N.J., 23--47.Google Scholar
- Koons, D., Sparrell, C., and Thorisson, K. 1993. Integrating simultaneous input from speech, gaze, and hand gesres. In Intelligent Multimedia Interfaces, M. Maybury, Ed., AAAI Press; The MIT Press, Cambridge, Mass., 257--276. Google Scholar
- Koons, D., Sparrell, C., and Thorisson, K. 1998. Integrating simultaneous input from speech, gaze, and hand gestures. In Intelligent Multimedia Interfaces, M. Maybury, Ed., AAAI Press; The MIT Press, Cambridge, Mass., 53--62. Google Scholar
- Ladd, D. 1996. Intonational Phonology. Cambridge University Press, Cambridge.Google Scholar
- Lanitis, A., Taylor, C., Cootes, T., and Ahmed, T. 1995. Automatic interpretation of human faces and hand gestures. In Proceedings of the International Workshop on Automatic Face & Gesture Recognition (Zurich) 98--103.Google Scholar
- McNeill, D. 1992. Hand and Mind: What Gestures Reveal About Thought. University of Chicago Press, Chicago.Google Scholar
- McNeill, D. 2000a. Catchments and context: Non-modular factors in speech and gesture. In Language and Gesture, D. McNeill, Ed., Cambridge University Press, Cambridge, Chapter 15, 312--328.Google Scholar
- McNeill, D. 2000b. Growth points, catchments, and contexts. Cogn. Stud. Bull. Japan. Cogn. Sci. Soc. 7, 1.Google Scholar
- McNeill, D. and Duncan, S. 2000. Growth points in thinking-for-speaking. In Language and Gesture, D. McNeill, Ed., Cambridge University Press, Cambridge, Chapter 7, 141--161.Google Scholar
- McNeill, D., Quek, F., McCullough, K.-E., Duncan, S., Furuyama, N., Bryll, R., Ma, X.-F., and Ansari, R. 2001. Catchments, prosody and discourse. Gesture in press.Google Scholar
- Nakatani, C., Grosz, B., Ahn, D., and Hirschberg, J. 1995. Instructions for annotating discourses. Tech. Rep. TR-21-95, Center for Research in Computer Technology, Harvard University, Cambridge, Mass.Google Scholar
- Neal, J. G., Thielman, C. Y., Dobes, Z., Haller, S. M., and Shapiro, S. C. 1989. Natural language with integrated deictic and graphic gestures. In Proceedings of the Speech and Natural Language Workshop (Cape Cod, Mass.) 410--423. Google Scholar
- Neal, J. G., Thielman, C. Y., Dobes, Z., Haller, S. M., and Shapiro, S. C. 1998. Natural language with integrated deictic and graphic gestures. In Readings in Intelligent User Interfaces, M. Maybury and W. Wahlster, Eds., Morgan Kaufman, San Francisco, 38--51. Google Scholar
- Oviatt, S. 1999. Mutual disambiguation of recognition errors in a multimodal architecture. In Proceedings of the CHI 99, Vol. 1, 576--583. Google Scholar
- Oviatt, S. and Cohen, P. 2000. Multimodal interfaces that process what comes naturally. Commun. ACM 43, 3, 43--53. Google Scholar
- Oviatt, S., DeAngeli, A., and Kuhn, K. 1999. Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of the CHI 97, Vol. 1, 415--422. Google Scholar
- Pavlović, V., Sharma, R., and Huang, T. 1997. Visual interpretation of hand gestures for human-computer interaction: A review. PAMI 19,7 (July), 677--695. Google Scholar
- Pavlovic, V. I., Sharma, R., and Huang, T. S. 1996. Gestural interface to a visual computing environment for molecular biologsts. In FG96 (Killington, Vt.), 30--35. Google Scholar
- Prillwitz, S., Leven, R., Zienert H., Hanke, T., and Henning, J. 1989. Hamburg Notation System for Sign Languages---An Introductory Guide. Signum, Hamburg.Google Scholar
- Quek, F. 1995. Eyes in the interface. Int. J. Image Vis. Comput. 13, 6 (Aug.), 511--525.Google Scholar
- Quek, F. 1996. Unencumbered gestural interaction. IEEE Multimedia 4, 3, 36--47. Google Scholar
- Quek, F. and Bryll, R. 1998. Vector coherence mapping: A parallelizable approach to image flow computation. In ACCV, Vol. 2, Hong Kong, 591--598. Google Scholar
- Quek, F. and McNeill, D. 2000. A multimedia system for temporally situated perceptual psycholinguistic analysis. In Measuring Behavior 2000, Nijmegen, NL, 257.Google Scholar
- Quek, F., Bryll, R., Arslan, H., Kirbas, C., and McNeill, D. 2001. A multimedia database system for temporally situated perceptual psycholinguistic analysis. Multimedia Tools Apps.in Press. Google Scholar
- Quek, F., Ma, X., and Bryll, R. 1999. A paralle algorithm for dynamic gesture tracking. In Proceedings of the ICCV'99 Workshop on RATFG-RTS (Corfu), 119--126. Google Scholar
- Quek, F., Yarger, R., Hachiahmetoglu, Y., Ohya, J., Shinjiro, K., Nakatsu., and McNeill, D. 2000. Bunshin: A believable avatar surrogate for both scripted and on-the-fly pen-based control in a presentation environment. In Emerging Technologies, SIGGRAPH 2000 (New Orleans) 187 (abstract) and CD--ROM (full paper).Google Scholar
- Schlenzig, J., Hunter, E., and Jain, R. 1994. Recursive identification of gesture inputs using hidden Markov models. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision (Pacific Grove, Calif.).Google Scholar
- Sowa, T. and Wachsmuth, I. 1999. Understanding coverbal dimensional gestures in a virtual design environment. In Proceedings of the ESCA Workshop on Interactive Dialogue in Multi-Modal Systems, P. Dalsgaard, C.-H. Lee, P. Heisterkamp, and R. Cole, Eds., Kloster Irsee, Germany, 117--120.Google Scholar
- Sowa, T. and Wachsmuth, I. 2000. Coverbal iconic gestures for object descriptions in virtual environments: An empirical study. In Post-Proceedings of the Conference of Gestures: Meaning and Use (Porto, Portugal).Google Scholar
- Triesch, J. and von der Malsburg, C. 1996. Robust classification of hand postures against complex backgrounds. In FG96 (Killington, Vt.), 170--175. Google Scholar
- Wexelblat, A. 1995. An approach to natural gesture in virtual environments. ACM Trans. Comput. Hum. Interact. 2, 3 (Sept.), 179--200. Google Scholar
- Wexelblat, A. 1997. Research challenges in gesture: Open issues and unsolved problems. In Proceedings of the International Gesture Workshop: Gesture & Sign Language in HCI, I. Wachsmuth and M. Frohlich, Eds., Springer, Bielefeld, Germany, 1--11. Google Scholar
- Wilson, A. D., Bobick, A. F., and Cassell, J. 1996. Recovering temporal structure of natural gesture. In FG96 (Killington, Vt.), 66--71. Google Scholar
- Yamato, J., Ohya, J., and Ishii, K. 1992. Recognizing human action in time-sequential images using hidden Markov model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 379--385.Google Scholar
Index Terms
- Multimodal human discourse: gesture and speech
Recommendations
Multimodal Collaboration in Expository Discourse: Verbal and Nonverbal Moves Alignment
Speech and ComputerAbstractThe paper explores multimodal collaboration in expository discourse considering verbal and nonverbal moves used by the participants in turn-taking. It reports the results of an experiment which tested speech, gesture and gaze alignment as affected ...
Spontaneous spoken dialogues with the furhat human-like robot head
HRI '14: Proceedings of the 2014 ACM/IEEE international conference on Human-robot interactionFurhat [1] is a robot head that deploys a back-projected animated face that is realistic and human-like in anatomy. Furhat relies on a state-of-the-art facial animation architecture allowing accurate synchronized lip movements with speech, and the ...
Multimodal multiparty social interaction with the furhat head
ICMI '12: Proceedings of the 14th ACM international conference on Multimodal interactionWe will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is a human-like interface that utilizes facial animation for physical robot heads ...
Comments