skip to main content
article

Multimodal human discourse: gesture and speech

Published:01 September 2002Publication History
Skip Abstract Section

Abstract

Gesture and speech combine to form a rich basis for human conversational interaction. To exploit these modalities in HCI, we need to understand the interplay between them and the way in which they support communication. We propose a framework for the gesture research done to date, and present our work on the cross-modal cues for discourse segmentation in free-form gesticulation accompanying speech in natural conversation as a new paradigm for such multimodal interaction. The basis for this integration is the psycholinguistic concept of the coequal generation of gesture and speech from the same semantic intent. We present a detailed case study of a gesture and speech elicitation experiment in which a subject describes her living space to an interlocutor. We perform two independent sets of analyses on the video and audio data: video and audio analysis to extract segmentation cues, and expert transcription of the speech and gesture data by microanalyzing the videotape using a frame-accurate videoplayer to correlate the speech with the gestural entities. We compare the results of both analyses to identify the cues accessible in the gestural and audio data that correlate well with the expert psycholinguistic analysis. We show that "handedness" and the kind of symmetry in two-handed gestures provide effective supersegmental discourse cues.

References

  1. Ansari, R., Dai, Y., Lou, J., McNeill, D., and Quek, F. 1999. Representation of prosodic structure in speech using nonlinear methods. In Workshop on Nonlinear Signal & Image Processing (Antalya, TU, June 20--23).Google ScholarGoogle Scholar
  2. Boehme, H.-J., Brakensiek, A., Braumann, U.-D., Krabbes, M., and Gross, H.-M. 1997. Neural architecture for gesture-based human-machine-interaction. In Proceedings of the International Gesture Workshop: Gesture & Sign Language in HCI, I. Wachsmuth and M. Frohlich, Eds., Springer, Bielefeld, Germany, 219--232. Google ScholarGoogle Scholar
  3. Boersma, P. and Weenik, D. 1996. Praat, a system for doing phonetics by computer. Tech. Rep. 132, Institute of Phonetic Sciences of the University of Amsterdam.Google ScholarGoogle Scholar
  4. Bolt, R. A. 1980. Put-that there. Comput. Graph. 14, 262--270. Google ScholarGoogle Scholar
  5. Bolt, R. A. 1982. Eyes at the interface. In Proceedings of the ACM CHI Human Factors in Computing Systems Conference, 360--362. Google ScholarGoogle Scholar
  6. brittanica.com. Encyclopaedia brittanica web site. http://www.brittanica.com.Google ScholarGoogle Scholar
  7. Cohen, P., Dalrymple, M., Moran, D., Pereira, F., Sullivan, J., Gargan, R., Schlossberg, J., and Tyler, S. 1989. Synergistic use of direct manipulation and natural language. In Human Factors in Computing Systems: CHI'89 Conference Proceedings, ACM, Addison-Wesley, Reading, Mass., 227--234. Google ScholarGoogle Scholar
  8. Cohen, P., Dalrymple, M., Moran, D., Pereira, F., Sullivan, J., Gargan, R., Schlossberg, J., and Tyler, S. 1998. Synergistic use of direct manipulation and natural language. In Readings in Intelligent User Interfaces, M. Maybury and W. Wahlster, Eds., Morgan Kaufman, San Francisco, 29--35. Google ScholarGoogle Scholar
  9. Edwards, A. 1997. Progress in sign language recognition. In Proceedings of the International Gesture Workshop on Gesture & Sign Language in HCI, I. Wachsmuth and M. Frohlich, Eds., Springer, Bielefeld, Germany, 13--21. Google ScholarGoogle Scholar
  10. Franklin, D., Kahn, R. E., Swain, M. J., and Firby, R. J. 1996. Happy patrons make better tippers creating a robot waiter using Perseus and the animate agent architecture. In FG96 (Killington, Vt.), 253--258. Google ScholarGoogle Scholar
  11. Hofmann, F., Heyer, P., and Hommel, G. 1997. Velocity profile based recognition of dynamic gestures with discrete hidden Markov models. In Proceedings of the International Gesture Workshop: Gesture & Sign Language in HCI, I. Wachsmuth and M. Frohlich, Eds., Springer, Bielefeld, Germany, 81--95. Google ScholarGoogle Scholar
  12. Kahn, R., Swain, M., Projkopowicz, P., and Firby, R. 1994. Gesture recognition using the Perseus architecture. In Proceedings of the IEEE Conference on CVPR, IEEE Computer Society, Los Alamitos, Calif., 734--741. Google ScholarGoogle Scholar
  13. Kendon, A. 1986. Current issues in the study of gesture. In The Biological Foundations of Gestures: Motor and Semiotic Aspects, J.-L. Nespoulous, P. Peron, and A. Lecours, Eds., Lawrence Erlbaum, Hillsdale, N.J., 23--47.Google ScholarGoogle Scholar
  14. Koons, D., Sparrell, C., and Thorisson, K. 1993. Integrating simultaneous input from speech, gaze, and hand gesres. In Intelligent Multimedia Interfaces, M. Maybury, Ed., AAAI Press; The MIT Press, Cambridge, Mass., 257--276. Google ScholarGoogle Scholar
  15. Koons, D., Sparrell, C., and Thorisson, K. 1998. Integrating simultaneous input from speech, gaze, and hand gestures. In Intelligent Multimedia Interfaces, M. Maybury, Ed., AAAI Press; The MIT Press, Cambridge, Mass., 53--62. Google ScholarGoogle Scholar
  16. Ladd, D. 1996. Intonational Phonology. Cambridge University Press, Cambridge.Google ScholarGoogle Scholar
  17. Lanitis, A., Taylor, C., Cootes, T., and Ahmed, T. 1995. Automatic interpretation of human faces and hand gestures. In Proceedings of the International Workshop on Automatic Face & Gesture Recognition (Zurich) 98--103.Google ScholarGoogle Scholar
  18. McNeill, D. 1992. Hand and Mind: What Gestures Reveal About Thought. University of Chicago Press, Chicago.Google ScholarGoogle Scholar
  19. McNeill, D. 2000a. Catchments and context: Non-modular factors in speech and gesture. In Language and Gesture, D. McNeill, Ed., Cambridge University Press, Cambridge, Chapter 15, 312--328.Google ScholarGoogle Scholar
  20. McNeill, D. 2000b. Growth points, catchments, and contexts. Cogn. Stud. Bull. Japan. Cogn. Sci. Soc. 7, 1.Google ScholarGoogle Scholar
  21. McNeill, D. and Duncan, S. 2000. Growth points in thinking-for-speaking. In Language and Gesture, D. McNeill, Ed., Cambridge University Press, Cambridge, Chapter 7, 141--161.Google ScholarGoogle Scholar
  22. McNeill, D., Quek, F., McCullough, K.-E., Duncan, S., Furuyama, N., Bryll, R., Ma, X.-F., and Ansari, R. 2001. Catchments, prosody and discourse. Gesture in press.Google ScholarGoogle Scholar
  23. Nakatani, C., Grosz, B., Ahn, D., and Hirschberg, J. 1995. Instructions for annotating discourses. Tech. Rep. TR-21-95, Center for Research in Computer Technology, Harvard University, Cambridge, Mass.Google ScholarGoogle Scholar
  24. Neal, J. G., Thielman, C. Y., Dobes, Z., Haller, S. M., and Shapiro, S. C. 1989. Natural language with integrated deictic and graphic gestures. In Proceedings of the Speech and Natural Language Workshop (Cape Cod, Mass.) 410--423. Google ScholarGoogle Scholar
  25. Neal, J. G., Thielman, C. Y., Dobes, Z., Haller, S. M., and Shapiro, S. C. 1998. Natural language with integrated deictic and graphic gestures. In Readings in Intelligent User Interfaces, M. Maybury and W. Wahlster, Eds., Morgan Kaufman, San Francisco, 38--51. Google ScholarGoogle Scholar
  26. Oviatt, S. 1999. Mutual disambiguation of recognition errors in a multimodal architecture. In Proceedings of the CHI 99, Vol. 1, 576--583. Google ScholarGoogle Scholar
  27. Oviatt, S. and Cohen, P. 2000. Multimodal interfaces that process what comes naturally. Commun. ACM 43, 3, 43--53. Google ScholarGoogle Scholar
  28. Oviatt, S., DeAngeli, A., and Kuhn, K. 1999. Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of the CHI 97, Vol. 1, 415--422. Google ScholarGoogle Scholar
  29. Pavlović, V., Sharma, R., and Huang, T. 1997. Visual interpretation of hand gestures for human-computer interaction: A review. PAMI 19,7 (July), 677--695. Google ScholarGoogle Scholar
  30. Pavlovic, V. I., Sharma, R., and Huang, T. S. 1996. Gestural interface to a visual computing environment for molecular biologsts. In FG96 (Killington, Vt.), 30--35. Google ScholarGoogle Scholar
  31. Prillwitz, S., Leven, R., Zienert H., Hanke, T., and Henning, J. 1989. Hamburg Notation System for Sign Languages---An Introductory Guide. Signum, Hamburg.Google ScholarGoogle Scholar
  32. Quek, F. 1995. Eyes in the interface. Int. J. Image Vis. Comput. 13, 6 (Aug.), 511--525.Google ScholarGoogle Scholar
  33. Quek, F. 1996. Unencumbered gestural interaction. IEEE Multimedia 4, 3, 36--47. Google ScholarGoogle Scholar
  34. Quek, F. and Bryll, R. 1998. Vector coherence mapping: A parallelizable approach to image flow computation. In ACCV, Vol. 2, Hong Kong, 591--598. Google ScholarGoogle Scholar
  35. Quek, F. and McNeill, D. 2000. A multimedia system for temporally situated perceptual psycholinguistic analysis. In Measuring Behavior 2000, Nijmegen, NL, 257.Google ScholarGoogle Scholar
  36. Quek, F., Bryll, R., Arslan, H., Kirbas, C., and McNeill, D. 2001. A multimedia database system for temporally situated perceptual psycholinguistic analysis. Multimedia Tools Apps.in Press. Google ScholarGoogle Scholar
  37. Quek, F., Ma, X., and Bryll, R. 1999. A paralle algorithm for dynamic gesture tracking. In Proceedings of the ICCV'99 Workshop on RATFG-RTS (Corfu), 119--126. Google ScholarGoogle Scholar
  38. Quek, F., Yarger, R., Hachiahmetoglu, Y., Ohya, J., Shinjiro, K., Nakatsu., and McNeill, D. 2000. Bunshin: A believable avatar surrogate for both scripted and on-the-fly pen-based control in a presentation environment. In Emerging Technologies, SIGGRAPH 2000 (New Orleans) 187 (abstract) and CD--ROM (full paper).Google ScholarGoogle Scholar
  39. Schlenzig, J., Hunter, E., and Jain, R. 1994. Recursive identification of gesture inputs using hidden Markov models. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision (Pacific Grove, Calif.).Google ScholarGoogle Scholar
  40. Sowa, T. and Wachsmuth, I. 1999. Understanding coverbal dimensional gestures in a virtual design environment. In Proceedings of the ESCA Workshop on Interactive Dialogue in Multi-Modal Systems, P. Dalsgaard, C.-H. Lee, P. Heisterkamp, and R. Cole, Eds., Kloster Irsee, Germany, 117--120.Google ScholarGoogle Scholar
  41. Sowa, T. and Wachsmuth, I. 2000. Coverbal iconic gestures for object descriptions in virtual environments: An empirical study. In Post-Proceedings of the Conference of Gestures: Meaning and Use (Porto, Portugal).Google ScholarGoogle Scholar
  42. Triesch, J. and von der Malsburg, C. 1996. Robust classification of hand postures against complex backgrounds. In FG96 (Killington, Vt.), 170--175. Google ScholarGoogle Scholar
  43. Wexelblat, A. 1995. An approach to natural gesture in virtual environments. ACM Trans. Comput. Hum. Interact. 2, 3 (Sept.), 179--200. Google ScholarGoogle Scholar
  44. Wexelblat, A. 1997. Research challenges in gesture: Open issues and unsolved problems. In Proceedings of the International Gesture Workshop: Gesture & Sign Language in HCI, I. Wachsmuth and M. Frohlich, Eds., Springer, Bielefeld, Germany, 1--11. Google ScholarGoogle Scholar
  45. Wilson, A. D., Bobick, A. F., and Cassell, J. 1996. Recovering temporal structure of natural gesture. In FG96 (Killington, Vt.), 66--71. Google ScholarGoogle Scholar
  46. Yamato, J., Ohya, J., and Ishii, K. 1992. Recognizing human action in time-sequential images using hidden Markov model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 379--385.Google ScholarGoogle Scholar

Index Terms

  1. Multimodal human discourse: gesture and speech

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader