skip to main content
10.1145/1027933.1027939acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
Article

A framework for evaluating multimodal integration by humans and a role for embodied conversational agents

Published: 13 October 2004 Publication History

Abstract

One of the implicit assumptions of multi-modal interfaces is that human-computer interaction is significantly facilitated by providing multiple input and output modalities. Surprisingly, however, there is very little theoretical and empirical research testing this assumption in terms of the presentation of multimodal displays to the user. The goal of this paper is provide both a theoretical and empirical framework for addressing this important issue. Two contrasting models of human information processing are formulated and contrasted in experimental tests. According to integration models, multiple sensory influences are continuously combined during categorization, leading to perceptual experience and action. The Fuzzy Logical Model of Perception (FLMP) assumes that processing occurs in three successive but overlapping stages: evaluation, integration, and decision (Massaro, 1998). According to nonintegration models, any perceptual experience and action results from only a single sensory influence. These models are tested in expanded factorial designs in which two input modalities are varied independently of one another in a factorial design and each modality is also presented alone. Results from a variety of experiments on speech, emotion, and gesture support the predictions of the FLMP. Baldi, an embodied conversational agent, is described and implications for applications of multimodal interfaces are discussed.

References

[1]
Anastasio, T. J., & Patton, P. E. (2004). Analysis and modeling of multisensory enhancement in the deep superior colliculus. In G. Calvert, C. Spence & B. E. Stein (Eds.), Handbook of Multisensory Processes (pp. 265-283). Cambridge, MA: MIT Press.
[2]
Andre, E. (2004). Lessons Learned from Evaluating Animated Presentation Agents. Workshop on Evaluating Embodied Conversational Agents, Schloß Dagstuhl, Germany.
[3]
Bauckhage, C., Fritsch, J., Rohlfing, K., Wachsmuth, S. & Sagerer, G, (2002). Evaluating Integrated Speech- and Image Understanding. Proceedings of the 4th IEEE international conference on Multimodal interfaces (pp. 9--14). Pittsburgh, Pennsylvania. October 14--16.
[4]
Bosseler, A. & Massaro, D.W. (2003). Development and Evaluation of a Computer-Animated Tutor for Vocabulary and Language Learning for Children with Autism. Journal of Autism and Developmental Disorders, 33, 653--672.
[5]
Campbell, C. S.; Schwarzer, G.; Massaro, D. W. (2001). Face perception: An information processing perspective. In M.J. Wenger, & J.T. Townsend (Eds.), Computational, geometric, and process perspectives on facial cognition: Contexts and challenges (pp. 285--345). Lawrence Erlbaum Associates, Inc., Publishers: Mahwah, NJ.
[6]
Chai, J., Pan, S., Zhou, M. & Houck, K. (2002). Context-Based Multimodal Input Understanding in Conversational Systems. Proceedings of the 4th IEEE international conference on Multimodal interfaces (pp. 87--92). Pittsburgh, Pennsylvania. October 14-16.
[7]
Cohen, M.M., Beskow, J. & Massaro, D.W. (1998). Recent developments in facial animation: An inside view. AVSP '98 (Dec 4-6, 1998, Sydney, Australia). http://mambo.ucsc.edu/psl/avsp98/11.doc
[8]
Cohen, M.M., Massaro, D.W. & Clark, R. (2002) Training a talking head. In Proceedings of ICMI'02, IEEE Fourth International Conference on Multimodal Interfaces. October 14--16, Pittsburgh, Pennsylvannia.
[9]
Corradini, A. Wesson, R, & Cohen, P. (2003). A Map-Based System Using Speech and 3D Gestures for Pervasive Computing. Proceedings of the 4th IEEE international conference on Multimodal interfaces (pp. 191--196). Pittsburgh, Pennsylvania. October 14--16.
[10]
de Gelder, B. & Vroomen, J. (2000). Perceiving Emotions by Ear and by Eye. Cognition & Emotion 14, 289--311.
[11]
Erber, N. P. (1972). Auditory, visual, and auditory-visual recognition of consonants by children with normal and impaired hearing. Journal of Speech and Hearing Research, 15, 423--422.
[12]
Harrison, M. & Thimbleby, H. (1990). Formal methods in human-computer interaction. Cambridge University Press, Cambridge, MA.
[13]
Horvitz, E. Kadie, C.M., Paek, T. &. Hovel, D. (2003). Models of Attention in Computing and Communications: From Principles to Applications, Communications of the ACM 46(3), 52--59.
[14]
Jesse, A., Vrignaud, N. & Massaro, D.W. (2000/01). The processing of information from multiple sources in simultaneous interpreting. Interpreting, 5, 95--115.
[15]
Lederman, S. J. & Klatzky, R. L. (2004). Multisensory texture perception. In G. Calvert, C. Spence & B. E. Stein (Eds.), Handbook of Multisensory Processes. (pp. 107--122). Cambridge, MA: MIT Press.
[16]
Lewkowicz, D. J. & Kraebel, K. S. (2004). The value of multisensory redundancy in the development of intersensory perception. In G. Calvert, C. Spence & B. E. Stein (Eds.), Handbook of Multisensory Processes (pp. 655--678). Cambridge, MA: MIT Press.
[17]
Massaro, D.W. (1984). Children's perception of visual and auditory speech. Child Development, 55, 1777--1788.
[18]
Massaro, D.W. (1987). Speech perception by ear and eye: A Paradigm for psychological inquiry. Hillsdale, NJ: Erlbaum.
[19]
Massaro, D.W. (1988). Ambiguity in perception and experimentation. Journal of Experimental Psychology: General, 117, 417--421.
[20]
Massaro, D.W. (1989). Testing between the TRACE model and the Fuzzy Logical Model of speech perception. Cognitive Psychology 21, 398--421.
[21]
Massaro, D.W. (1998). Perceiving talking faces: From speech perception to a behavioral principle. Cambridge, MA: MIT Press.
[22]
Massaro, D.W. (1999). From theory to practice: Rewards and challenges. In Proceedings of the International Conference of Phonetic Sciences (pp. 1289--1292). San Francisco, CA.
[23]
Massaro, D.W. (2000). From "Speech is Special" to Talking Heads in Language Learning. In Proceedings of Integrating speech technology in the (language) learning and assistive interface, (InSTIL 2000) (pp.153--161). University of Abertay Dundee, Scotland.
[24]
Massaro, D.W. (2002). Multimodal Speech Perception: A Paradigm for Speech Science. In B. Granstrom, D. House & I. Karlsson (Eds.), Multilmodality in language and speech systems (pp.45--71). The Netherlands: Kluwer Academic Publishers
[25]
Massaro, D.W. (2003). A computer-animated tutor for spoken and written language learning. Proceedings of the 5th international conference on Multimodal interfaces (pp. 172--175). Vancouver, British Columbia, Canada.
[26]
Massaro, D.W. & Bosseler, A. (2003). Perceiving Speech by Ear and Eye: Multimodal Integration by Children with Autism. Journal of Developmental and Learning Disorders, 7, 111--144.
[27]
Massaro, D.W. & Cohen, M.M. (1993). Perceiving Asynchronous Bimodal Speech in Consonant-Vowel and Vowel Syllables. Speech Communication, 13, 127--134.
[28]
Massaro, D.W. & Cohen, M.M. (1999). Speech perception in hearing-impaired perceivers: Synergy of multiple modalities. Journal of Speech, Language & Hearing Science, 42, 21--41.
[29]
Massaro, D.W. & Cohen, M.M. (2000). Fuzzy logical model of bimodal emotion perception: Comment on "The perception of emotions by ear and by eye" by de Gelder and Vroomen. Cognition and Emotion, 14(3), 313--320.
[30]
Massaro, D.W., Cohen, M.M., Tabain, M., Beskow, J. & Clark, R. (in press). Animated speech: Research progress and applications. In E. Vatiokis-Bateson, G. Bailly & P. Perrier (Eds.) Audiovisual Speech Processing, Cambridge, MA: MIT Press.
[31]
Massaro, D.W. & Friedman, D. (1990). Models of integration given multiple sources of information. Psychological Review, 97(2) 225--252.
[32]
Massaro, D.W. & Light, J. (2003). Read My Tongue Movements: Bimodal Learning To Perceive And Produce Non-Native Speech /r/ and /l/. In Proceedings of Eurospeech '03-Switzerland (Interspeech). 8th European Conference on Speech Communication and Technology. Geneva, Switzerland.
[33]
Massaro, D.W. & Light, J. (in press). Using Visible Speech for Training Perception and Production of Speech for Hard of Hearing Individuals. Volta Review.
[34]
Massaro, D.W. & Stork, D. G. (1998). Sensory integration and speechreading by humans and machines. American Scientist, 86, 236--244.
[35]
McNeill, D. (1985). So you think gestures are nonverbal? Psychological Review, 92, 350--371.
[36]
Mesulam, M.M. (1998). From sensation to cognition. Brain, 121, 1013--1052.
[37]
Moore, M. & Calvert, S. (2000). Brief Report: Vocabulary acquisition for children with autism: Teacher or computer instruction. Journal of Autism and Developmental Disorders, 30, 359--362.
[38]
Movellan, J. R. & McClelland, J. L. (2001). The Morton-Massaro law of information integration: Implications for models of perception. Psychological Review, 108, 113--148.
[39]
Munhall, K., & Vatikiotis-Bateson, E. (2004). Spatial and Temporal Constraints on Audiovisual Speech Perception. In G. Calvert, C. Spence & B. E. Stein (Eds.), Handbook of Multisensory Processes (pp. 177--188). Cambridge, MA: MIT Press.
[40]
Ouni, S., Massaro, D.W., Cohen, M.M. & Young, K. (2003) Internationalization of a talking head. 15th International Congress of Phonetic Sciences. Barcelona, Spain.
[41]
Oviatt, S., Coulston, R., Tomko, S., Xiao, B., Lunsford, R., Wesson, M., Carmichael L. (2003). Toward a theory of organized multimodal integration patterns during human-computer interaction. Proceedings of the 5th international conference on Multimodal interfaces (pp. 44--51). Vancouver, British Columbia, Canada.
[42]
Pashler, H. E. (1998). The psychology of attention. Cambridge, MA: MIT Press.
[43]
Potamianos, G., Neti, C., Gravier, G. & Garg, A. (2003). Automatic Recognition of audio-visual speech: Recent progress and challenges. In Proceedings of the IEEE, 91(9), (pp.1306--1326).
[44]
Stein, B. E., & Meredith, M. A. (1993). The merging of the senses. Cambridge, MA: MIT Press.
[45]
Thompson, L.A. & Massaro, D.W. (1994). Children's Integration of Speech and Pointing Gestures in Comprehension. Journal of Experimental Child Psychology, 57, 327--354.
[46]
Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338--353.

Cited By

View all
  • (2024)Beyond Text and Speech in Conversational Agents: Mapping the Design Space of AvatarsProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661563(1875-1894)Online publication date: 1-Jul-2024
  • (2023)Enhancing Conversational Troubleshooting with Multi-modality: Design and ImplementationChatbot Research and Design10.1007/978-3-031-25581-6_7(103-117)Online publication date: 2-Feb-2023
  • (2022)MCTK: a Multi-modal Conversational Troubleshooting Kit for supporting users in web applicationsProceedings of the 2022 International Conference on Advanced Visual Interfaces10.1145/3531073.3534480(1-3)Online publication date: 6-Jun-2022
  • Show More Cited By

Index Terms

  1. A framework for evaluating multimodal integration by humans and a role for embodied conversational agents

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMI '04: Proceedings of the 6th international conference on Multimodal interfaces
    October 2004
    368 pages
    ISBN:1581139950
    DOI:10.1145/1027933
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 October 2004

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. emotion
    2. gesture
    3. multisensory integration
    4. speech

    Qualifiers

    • Article

    Conference

    ICMI04
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 453 of 1,080 submissions, 42%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Beyond Text and Speech in Conversational Agents: Mapping the Design Space of AvatarsProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661563(1875-1894)Online publication date: 1-Jul-2024
    • (2023)Enhancing Conversational Troubleshooting with Multi-modality: Design and ImplementationChatbot Research and Design10.1007/978-3-031-25581-6_7(103-117)Online publication date: 2-Feb-2023
    • (2022)MCTK: a Multi-modal Conversational Troubleshooting Kit for supporting users in web applicationsProceedings of the 2022 International Conference on Advanced Visual Interfaces10.1145/3531073.3534480(1-3)Online publication date: 6-Jun-2022
    • (2012)Supporting Usability Evaluation of Multimodal Man-Machine Interfaces for Space Ground Segment Applications Using Petri nets Based Formal SpecificationSpaceOps 2006 Conference10.2514/6.2006-5657Online publication date: 18-Jun-2012
    • (2011)Goal orientated conversational agentsProceedings of the 5th KES international conference on Agent and multi-agent systems: technologies and applications10.5555/2023144.2023149(16-25)Online publication date: 29-Jun-2011
    • (2011)Corrected Human Vision System and the McGurk EffectHCI International 2011 – Posters’ Extended Abstracts10.1007/978-3-642-22095-1_70(345-349)Online publication date: 2011
    • (2011)Goal Orientated Conversational Agents: Applications to Benefit SocietyAgent and Multi-Agent Systems: Technologies and Applications10.1007/978-3-642-22000-5_3(16-25)Online publication date: 2011
    • (2009)Perceiving emotion: towards a realistic understanding of the taskPhilosophical Transactions of the Royal Society B: Biological Sciences10.1098/rstb.2009.0139364:1535(3515-3525)Online publication date: 12-Dec-2009
    • (2009)Towards Computational Modelling of Neural Multimodal Integration Based on the Superior Colliculus ConceptInnovations in Neural Information Paradigms and Applications10.1007/978-3-642-04003-0_11(269-291)Online publication date: 2009

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media