skip to main content
10.1145/1027933.1027964acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
Article

Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3D pointing gestures

Published: 13 October 2004 Publication History

Abstract

This paper presents an architecture for fusion of multimodal input streams for natural interaction with a humanoid robot as well as results from a user study with our system. The presented fusion architecture consists of an application independent parser of input events, and application specific rules. In the presented user study, people could interact with a robot in a kitchen scenario, using speech and gesture input. In the study, we could observe that our fusion approach is very tolerant against falsely detected pointing gestures. This is because we use speech as the main modality and pointing gestures mainly for disambiguation of objects. In the paper we also report about the temporal correlation of speech and gesture events as observed in the user study.

References

[1]
T. Asfour, A. Ude, K.Berns, and R. Dillmann. Control of armar for the realization of anthropomorphic motion patterns. In The second IEEE-RAS International Conference on Humanoid Robots (HUMANOIDS 2001), pages 22--24, 2001.
[2]
R. A. Bolt. "put-that-there": Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer Graphics and Interactive Techniques, pages 262--270. ACM Press, 1980.
[3]
B. Carpenter. The Logic of Typed Feature Structures. Cambridge University Press, 1992.
[4]
P. R. Cohen, R. Coulston, and K. Krout. Multimodal interaction during multiparty dialogues: Initial results. In Proceedings of the International Conference On Multimodal Interfaces, 2002.
[5]
A. Corradini, R. M. Wesson, and P. R. Cohen. A map-based system using speech and 3d gestures for pervasive computing. In Proceedings of the 4th IEEE International Conference on Multimodal Interfaces, 2002.
[6]
M. Denecke. Object-oriented techniques in grammar and ontology specification. In The Workshop on Multilingual Speech Communication, pages 59--64, Kyoto, Japan, 2000.
[7]
M. Denecke. Rapid prototyping for spoken dialogue systems. In Proceedings of the 19th International Conference on Computational Linguistics, Taiwan, 2002.
[8]
J. Eisenstein and C. M. Christoudias. A salience-based approach to gesture-speech alignment. In Proceedings of the Human Language Technology conference / North American chapter of the Association for Computational Linguistics annual meeting, 2004.
[9]
M. Finke, P. Geutner, H. Hild, T. Kemp, K. Ries, and M. Westphal. The karlsruhe-verbmobil speech recognition engine. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, ICASSP-97, Munich, Germany, 1997.
[10]
C. Fuegen, H. Holzapfel, and A. Waibel. Tight coupling of speech recognition and dialog management -- dialog-context dependent grammar weighting for speech recognition. In Proceedings of the International Conference on Spoken Language Processing, 2004.
[11]
P. Gieselmann, C. Fuegen, H. Holzapfel, T. Schaaf, and A. Waibel. Towards multimodal communication with a household robot. In Proceedings of the International Conference on Humanoid Robots, 2003.
[12]
Third IEEE International Conference on Humanoid Robots - Humanoids, Karlsruhe, Germany, 2003.
[13]
IEEE/RSJ International Conference on Intelligent Robots and Systems, Sendai, Japan, 2004.
[14]
M. Johnston. Unification-based multimodal parsing. In COLING-ACL, pages 624--630, 1998.
[15]
M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker, and P. Maloor. Match: An architecture for multimodal dialogue systems. In Proceedings of ACL, 2002.
[16]
M. Johnston, P. R. Cohen, D. McGee, S. L. Oviatt, J. A. Pittman, and I. Smith. Unification-based multimodal integration. In Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics, pages 281--288, 1997.
[17]
M. Kaur, M. Tremaine, N. Huang, J. Wilder, Z. Gacovski, F. Flippo, and C. S. Mantravadi. Where is "it"? event synchronization in gaze-speech input systems. In Proceedings of the 5th International Conference on Multimodal Interfaces, pages 151--158. ACM Press, 2003.
[18]
S. Kettebekov, M. Yeasin, and R. Sharma. Prosody based co-analysis for continuous recognition of coverbal gestures. In Proceedings of the 4th International Conference on Multimodal Interfaces, Pittsburgh, USA, 2002.
[19]
S. C. Levinson. Pragmatics. Cambridge, England: Cambridge University, 1983.
[20]
K. Nickel and R. Stiefelhagen. Pointing gesture recognition based on 3d-tracking of face, hands and head orientation. In Proceedings of the Fifth International Conference on Multimodal Interfaces, Vancouver, Canada, Nov. 5-7 2003.
[21]
S. L. Oviatt, A. DeAngeli, and K. Kuhn. Integration and synchronization of input modes during multimodal human-computer interaction. In CHI, pages 415--422, 1997.
[22]
N. Reithinger, J. Alexandersson, T. Becker, A. Blocher, R. Engel, M. Löckelt, J. Müller, N. Pfleger, P. Poller, M. Streit, and V. Tschernomas. Smartkom - adaptive and flexible multimodal access to multiple applications. In Proceedings of the Fifth International Conference on Multimodal Interfaces, 2003.
[23]
R. Sharma, V. Pavlovic, and T. Huang. Toward multimodal human-computer interface. In Proceedings of the IEEE, volume~86, pages 853 -- 869, May 1998.
[24]
H. Soltau, F. Metze, C. Fuegen, and A. Waibel. A one pass- decoder based on polymorphic linguistic context assignment. In Proceedings of the Automatic Speech Recognition and Understanding Workshop, ASRU-2001, Madonna di Campiglio, Trento, Italy, December 2001.
[25]
R. Stiefelhagen, C. Fugen, P. Gieselmann, H. Holzapfel, K. Nickel, and A. Waibel. Natural human-robot interaction using speech, gaze and gestures. In Proceedings of the International Conference on Intelligent Robots and Systems, Sendai, Japan, 2004.
[26]
I. Wachsmuth. Communicative rhythm in gesture and speech. In Gesture-Based Communication in Human-Computer Interaction: International Gesture Workshop, GW'99, Gif-sur-Yvette, France, March 1999.
[27]
L. Wu, S. L. Oviatt, and P. R. Cohen. Multimodal integration - a statistical view. IEEE Transactions on Multimedia, 1(4):334--341, 1999.

Cited By

View all
  • (2023)A Parallel Multimodal Integration Framework and Application for Cake ShoppingApplied Sciences10.3390/app1401029914:1(299)Online publication date: 29-Dec-2023
  • (2023)MFIRA: Multimodal Fusion Intent Recognition Algorithm for AR Chemistry ExperimentsApplied Sciences10.3390/app1314820013:14(8200)Online publication date: 14-Jul-2023
  • (2023)Multimodal Error Correction with Natural Language and Pointing Gestures2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00212(1968-1978)Online publication date: 2-Oct-2023
  • Show More Cited By

Index Terms

  1. Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3D pointing gestures

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMI '04: Proceedings of the 6th international conference on Multimodal interfaces
    October 2004
    368 pages
    ISBN:1581139950
    DOI:10.1145/1027933
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 October 2004

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. gesture
    2. multimodal architectures
    3. multimodal fusion and multisensory integration
    4. natural language
    5. speech
    6. vision

    Qualifiers

    • Article

    Conference

    ICMI04
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 453 of 1,080 submissions, 42%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)23
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A Parallel Multimodal Integration Framework and Application for Cake ShoppingApplied Sciences10.3390/app1401029914:1(299)Online publication date: 29-Dec-2023
    • (2023)MFIRA: Multimodal Fusion Intent Recognition Algorithm for AR Chemistry ExperimentsApplied Sciences10.3390/app1314820013:14(8200)Online publication date: 14-Jul-2023
    • (2023)Multimodal Error Correction with Natural Language and Pointing Gestures2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00212(1968-1978)Online publication date: 2-Oct-2023
    • (2023)Interactive Multimodal Robot Dialog Using Pointing Gesture RecognitionComputer Vision – ECCV 2022 Workshops10.1007/978-3-031-25075-0_43(640-657)Online publication date: 19-Feb-2023
    • (2021)An intention understanding algorithm based on multimodal fusionSecond IYSF Academic Symposium on Artificial Intelligence and Computer Engineering10.1117/12.2623101(82)Online publication date: 1-Dec-2021
    • (2020)Research on Multimodal Perceptual Navigational Virtual and Real Fusion Intelligent Experiment Equipment and AlgorithmIEEE Access10.1109/ACCESS.2020.29780898(43375-43390)Online publication date: 2020
    • (2019)Introducing NarRob, a Robotic StorytellerDigital Forensics and Watermarking10.1007/978-3-030-11548-7_36(387-396)Online publication date: 22-Jan-2019
    • (2018)Semantic Fusion for Natural Multimodal Interfaces using Concurrent Augmented Transition NetworksMultimodal Technologies and Interaction10.3390/mti20400812:4(81)Online publication date: 6-Dec-2018
    • (2018)Up to the Finger TipProceedings of the 2018 Annual Symposium on Computer-Human Interaction in Play10.1145/3242671.3242675(477-488)Online publication date: 23-Oct-2018
    • (2018)A robust user interface for IoT using context-aware Bayesian fusion2018 IEEE 15th International Conference on Wearable and Implantable Body Sensor Networks (BSN)10.1109/BSN.2018.8329675(126-131)Online publication date: Mar-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media