article

Multimodal human discourse: gesture and speech

Authors:
Francis Quek

Wright State University, Dayton, OH

Wright State University, Dayton, OH
View Profile

,
David McNeill

University of Chicago

University of Chicago
View Profile

,
Robert Bryll

Wright State University, Dayton, OH

Wright State University, Dayton, OH
View Profile

,
Susan Duncan

Wright State University, University of Chicago

Wright State University, University of Chicago
View Profile

,
Xin-Feng Ma

University of Illinois at Chicago

University of Illinois at Chicago
View Profile

,
Cemil Kirbas

Wright State University

Wright State University
View Profile

,
Karl E. McCullough

University of Chicago

University of Chicago
View Profile

,
Rashid Ansari

University of Illinois at Chicago

University of Illinois at Chicago
View Profile

Authors Info & Claims

ACM Transactions on Computer-Human Interaction Volume 9 Issue 3pp 171–193https://doi.org/10.1145/568513.568514

Published:01 September 2002Publication History

ACM Transactions on Computer-Human Interaction

Abstract

Gesture and speech combine to form a rich basis for human conversational interaction. To exploit these modalities in HCI, we need to understand the interplay between them and the way in which they support communication. We propose a framework for the gesture research done to date, and present our work on the cross-modal cues for discourse segmentation in free-form gesticulation accompanying speech in natural conversation as a new paradigm for such multimodal interaction. The basis for this integration is the psycholinguistic concept of the coequal generation of gesture and speech from the same semantic intent. We present a detailed case study of a gesture and speech elicitation experiment in which a subject describes her living space to an interlocutor. We perform two independent sets of analyses on the video and audio data: video and audio analysis to extract segmentation cues, and expert transcription of the speech and gesture data by microanalyzing the videotape using a frame-accurate videoplayer to correlate the speech with the gestural entities. We compare the results of both analyses to identify the cues accessible in the gestural and audio data that correlate well with the expert psycholinguistic analysis. We show that "handedness" and the kind of symmetry in two-handed gestures provide effective supersegmental discourse cues.

References

Ansari, R., Dai, Y., Lou, J., McNeill, D., and Quek, F. 1999. Representation of prosodic structure in speech using nonlinear methods. In Workshop on Nonlinear Signal & Image Processing (Antalya, TU, June 20--23).Google Scholar
Boehme, H.-J., Brakensiek, A., Braumann, U.-D., Krabbes, M., and Gross, H.-M. 1997. Neural architecture for gesture-based human-machine-interaction. In Proceedings of the International Gesture Workshop: Gesture & Sign Language in HCI, I. Wachsmuth and M. Frohlich, Eds., Springer, Bielefeld, Germany, 219--232. Google Scholar
Boersma, P. and Weenik, D. 1996. Praat, a system for doing phonetics by computer. Tech. Rep. 132, Institute of Phonetic Sciences of the University of Amsterdam.Google Scholar
Bolt, R. A. 1980. Put-that there. Comput. Graph. 14, 262--270. Google Scholar
Bolt, R. A. 1982. Eyes at the interface. In Proceedings of the ACM CHI Human Factors in Computing Systems Conference, 360--362. Google Scholar
brittanica.com. Encyclopaedia brittanica web site. http://www.brittanica.com.Google Scholar
Cohen, P., Dalrymple, M., Moran, D., Pereira, F., Sullivan, J., Gargan, R., Schlossberg, J., and Tyler, S. 1989. Synergistic use of direct manipulation and natural language. In Human Factors in Computing Systems: CHI'89 Conference Proceedings, ACM, Addison-Wesley, Reading, Mass., 227--234. Google Scholar
Cohen, P., Dalrymple, M., Moran, D., Pereira, F., Sullivan, J., Gargan, R., Schlossberg, J., and Tyler, S. 1998. Synergistic use of direct manipulation and natural language. In Readings in Intelligent User Interfaces, M. Maybury and W. Wahlster, Eds., Morgan Kaufman, San Francisco, 29--35. Google Scholar
Edwards, A. 1997. Progress in sign language recognition. In Proceedings of the International Gesture Workshop on Gesture & Sign Language in HCI, I. Wachsmuth and M. Frohlich, Eds., Springer, Bielefeld, Germany, 13--21. Google Scholar
Franklin, D., Kahn, R. E., Swain, M. J., and Firby, R. J. 1996. Happy patrons make better tippers creating a robot waiter using Perseus and the animate agent architecture. In FG96 (Killington, Vt.), 253--258. Google Scholar
Hofmann, F., Heyer, P., and Hommel, G. 1997. Velocity profile based recognition of dynamic gestures with discrete hidden Markov models. In Proceedings of the International Gesture Workshop: Gesture & Sign Language in HCI, I. Wachsmuth and M. Frohlich, Eds., Springer, Bielefeld, Germany, 81--95. Google Scholar
Kahn, R., Swain, M., Projkopowicz, P., and Firby, R. 1994. Gesture recognition using the Perseus architecture. In Proceedings of the IEEE Conference on CVPR, IEEE Computer Society, Los Alamitos, Calif., 734--741. Google Scholar
Kendon, A. 1986. Current issues in the study of gesture. In The Biological Foundations of Gestures: Motor and Semiotic Aspects, J.-L. Nespoulous, P. Peron, and A. Lecours, Eds., Lawrence Erlbaum, Hillsdale, N.J., 23--47.Google Scholar
Koons, D., Sparrell, C., and Thorisson, K. 1993. Integrating simultaneous input from speech, gaze, and hand gesres. In Intelligent Multimedia Interfaces, M. Maybury, Ed., AAAI Press; The MIT Press, Cambridge, Mass., 257--276. Google Scholar
Koons, D., Sparrell, C., and Thorisson, K. 1998. Integrating simultaneous input from speech, gaze, and hand gestures. In Intelligent Multimedia Interfaces, M. Maybury, Ed., AAAI Press; The MIT Press, Cambridge, Mass., 53--62. Google Scholar
Ladd, D. 1996. Intonational Phonology. Cambridge University Press, Cambridge.Google Scholar
Lanitis, A., Taylor, C., Cootes, T., and Ahmed, T. 1995. Automatic interpretation of human faces and hand gestures. In Proceedings of the International Workshop on Automatic Face & Gesture Recognition (Zurich) 98--103.Google Scholar
McNeill, D. 1992. Hand and Mind: What Gestures Reveal About Thought. University of Chicago Press, Chicago.Google Scholar
McNeill, D. 2000a. Catchments and context: Non-modular factors in speech and gesture. In Language and Gesture, D. McNeill, Ed., Cambridge University Press, Cambridge, Chapter 15, 312--328.Google Scholar
McNeill, D. 2000b. Growth points, catchments, and contexts. Cogn. Stud. Bull. Japan. Cogn. Sci. Soc. 7, 1.Google Scholar
McNeill, D. and Duncan, S. 2000. Growth points in thinking-for-speaking. In Language and Gesture, D. McNeill, Ed., Cambridge University Press, Cambridge, Chapter 7, 141--161.Google Scholar
McNeill, D., Quek, F., McCullough, K.-E., Duncan, S., Furuyama, N., Bryll, R., Ma, X.-F., and Ansari, R. 2001. Catchments, prosody and discourse. Gesture in press.Google Scholar
Nakatani, C., Grosz, B., Ahn, D., and Hirschberg, J. 1995. Instructions for annotating discourses. Tech. Rep. TR-21-95, Center for Research in Computer Technology, Harvard University, Cambridge, Mass.Google Scholar
Neal, J. G., Thielman, C. Y., Dobes, Z., Haller, S. M., and Shapiro, S. C. 1989. Natural language with integrated deictic and graphic gestures. In Proceedings of the Speech and Natural Language Workshop (Cape Cod, Mass.) 410--423. Google Scholar
Neal, J. G., Thielman, C. Y., Dobes, Z., Haller, S. M., and Shapiro, S. C. 1998. Natural language with integrated deictic and graphic gestures. In Readings in Intelligent User Interfaces, M. Maybury and W. Wahlster, Eds., Morgan Kaufman, San Francisco, 38--51. Google Scholar
Oviatt, S. 1999. Mutual disambiguation of recognition errors in a multimodal architecture. In Proceedings of the CHI 99, Vol. 1, 576--583. Google Scholar
Oviatt, S. and Cohen, P. 2000. Multimodal interfaces that process what comes naturally. Commun. ACM 43, 3, 43--53. Google Scholar
Oviatt, S., DeAngeli, A., and Kuhn, K. 1999. Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of the CHI 97, Vol. 1, 415--422. Google Scholar
Pavlović, V., Sharma, R., and Huang, T. 1997. Visual interpretation of hand gestures for human-computer interaction: A review. PAMI 19,7 (July), 677--695. Google Scholar
Pavlovic, V. I., Sharma, R., and Huang, T. S. 1996. Gestural interface to a visual computing environment for molecular biologsts. In FG96 (Killington, Vt.), 30--35. Google Scholar
Prillwitz, S., Leven, R., Zienert H., Hanke, T., and Henning, J. 1989. Hamburg Notation System for Sign Languages---An Introductory Guide. Signum, Hamburg.Google Scholar
Quek, F. 1995. Eyes in the interface. Int. J. Image Vis. Comput. 13, 6 (Aug.), 511--525.Google Scholar
Quek, F. 1996. Unencumbered gestural interaction. IEEE Multimedia 4, 3, 36--47. Google Scholar
Quek, F. and Bryll, R. 1998. Vector coherence mapping: A parallelizable approach to image flow computation. In ACCV, Vol. 2, Hong Kong, 591--598. Google Scholar
Quek, F. and McNeill, D. 2000. A multimedia system for temporally situated perceptual psycholinguistic analysis. In Measuring Behavior 2000, Nijmegen, NL, 257.Google Scholar
Quek, F., Bryll, R., Arslan, H., Kirbas, C., and McNeill, D. 2001. A multimedia database system for temporally situated perceptual psycholinguistic analysis. Multimedia Tools Apps.in Press. Google Scholar
Quek, F., Ma, X., and Bryll, R. 1999. A paralle algorithm for dynamic gesture tracking. In Proceedings of the ICCV'99 Workshop on RATFG-RTS (Corfu), 119--126. Google Scholar
Quek, F., Yarger, R., Hachiahmetoglu, Y., Ohya, J., Shinjiro, K., Nakatsu., and McNeill, D. 2000. Bunshin: A believable avatar surrogate for both scripted and on-the-fly pen-based control in a presentation environment. In Emerging Technologies, SIGGRAPH 2000 (New Orleans) 187 (abstract) and CD--ROM (full paper).Google Scholar
Schlenzig, J., Hunter, E., and Jain, R. 1994. Recursive identification of gesture inputs using hidden Markov models. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision (Pacific Grove, Calif.).Google Scholar
Sowa, T. and Wachsmuth, I. 1999. Understanding coverbal dimensional gestures in a virtual design environment. In Proceedings of the ESCA Workshop on Interactive Dialogue in Multi-Modal Systems, P. Dalsgaard, C.-H. Lee, P. Heisterkamp, and R. Cole, Eds., Kloster Irsee, Germany, 117--120.Google Scholar
Sowa, T. and Wachsmuth, I. 2000. Coverbal iconic gestures for object descriptions in virtual environments: An empirical study. In Post-Proceedings of the Conference of Gestures: Meaning and Use (Porto, Portugal).Google Scholar
Triesch, J. and von der Malsburg, C. 1996. Robust classification of hand postures against complex backgrounds. In FG96 (Killington, Vt.), 170--175. Google Scholar
Wexelblat, A. 1995. An approach to natural gesture in virtual environments. ACM Trans. Comput. Hum. Interact. 2, 3 (Sept.), 179--200. Google Scholar
Wexelblat, A. 1997. Research challenges in gesture: Open issues and unsolved problems. In Proceedings of the International Gesture Workshop: Gesture & Sign Language in HCI, I. Wachsmuth and M. Frohlich, Eds., Springer, Bielefeld, Germany, 1--11. Google Scholar
Wilson, A. D., Bobick, A. F., and Cassell, J. 1996. Recovering temporal structure of natural gesture. In FG96 (Killington, Vt.), 66--71. Google Scholar
Yamato, J., Ohya, J., and Ishii, K. 1992. Recognizing human action in time-sequential images using hidden Markov model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 379--385.Google Scholar

Index Terms

Multimodal human discourse: gesture and speech
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI theory, concepts and models
    2. Interaction paradigms
  2. Interaction design
    1. Interaction design theory, concepts and paradigms

Recommendations

Multimodal Collaboration in Expository Discourse: Verbal and Nonverbal Moves Alignment
Speech and Computer
Abstract
The paper explores multimodal collaboration in expository discourse considering verbal and nonverbal moves used by the participants in turn-taking. It reports the results of an experiment which tested speech, gesture and gaze alignment as affected ...
Read More
Spontaneous spoken dialogues with the furhat human-like robot head
HRI '14: Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction

Furhat [1] is a robot head that deploys a back-projected animated face that is realistic and human-like in anatomy. Furhat relies on a state-of-the-art facial animation architecture allowing accurate synchronized lip movements with speech, and the ...
Read More
Multimodal multiparty social interaction with the furhat head
ICMI '12: Proceedings of the 14th ACM international conference on Multimodal interaction

We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is a human-like interface that utilizes facial animation for physical robot heads ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer-Human Interaction Volume 9, Issue 3
September 2002
81 pages
ISSN:1073-0516
EISSN:1557-7325
DOI:10.1145/568513
Issue’s Table of Contents

Copyright © 2002 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2002
Published in tochi Volume 9, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Multimodal interaction
conversational interaction
discourse
gesture
gesture analysis
human interaction models
speech
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 185
  Total Citations
  View Citations
- 4,912
  Total Downloads
- Downloads (Last 12 months)177
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodal human discourse: gesture and speech

ACM Transactions on Computer-Human Interaction

Abstract

References

Cited By

Index Terms

Recommendations

Multimodal Collaboration in Expository Discourse: Verbal and Nonverbal Moves Alignment

Spontaneous spoken dialogues with the furhat human-like robot head

Multimodal multiparty social interaction with the furhat head

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multimodal human discourse: gesture and speech

ACM Transactions on Computer-Human Interaction

Abstract

References

Cited By

Index Terms

Recommendations

Multimodal Collaboration in Expository Discourse: Verbal and Nonverbal Moves Alignment

Spontaneous spoken dialogues with the furhat human-like robot head

Multimodal multiparty social interaction with the furhat head

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media