Article

Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3D pointing gestures

Authors:

Hartwig Holzapfel,

Rainer StiefelhagenAuthors Info & Claims

ICMI '04: Proceedings of the 6th international conference on Multimodal interfaces

Pages 175 - 182

https://doi.org/10.1145/1027933.1027964

Published: 13 October 2004 Publication History

Abstract

This paper presents an architecture for fusion of multimodal input streams for natural interaction with a humanoid robot as well as results from a user study with our system. The presented fusion architecture consists of an application independent parser of input events, and application specific rules. In the presented user study, people could interact with a robot in a kitchen scenario, using speech and gesture input. In the study, we could observe that our fusion approach is very tolerant against falsely detected pointing gestures. This is because we use speech as the main modality and pointing gestures mainly for disambiguation of objects. In the paper we also report about the temporal correlation of speech and gesture events as observed in the user study.

References

[1]

T. Asfour, A. Ude, K.Berns, and R. Dillmann. Control of armar for the realization of anthropomorphic motion patterns. In The second IEEE-RAS International Conference on Humanoid Robots (HUMANOIDS 2001), pages 22--24, 2001.

[2]

R. A. Bolt. "put-that-there": Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer Graphics and Interactive Techniques, pages 262--270. ACM Press, 1980.

Digital Library

[3]

B. Carpenter. The Logic of Typed Feature Structures. Cambridge University Press, 1992.

Digital Library

[4]

P. R. Cohen, R. Coulston, and K. Krout. Multimodal interaction during multiparty dialogues: Initial results. In Proceedings of the International Conference On Multimodal Interfaces, 2002.

Digital Library

[5]

A. Corradini, R. M. Wesson, and P. R. Cohen. A map-based system using speech and 3d gestures for pervasive computing. In Proceedings of the 4th IEEE International Conference on Multimodal Interfaces, 2002.

Digital Library

[6]

M. Denecke. Object-oriented techniques in grammar and ontology specification. In The Workshop on Multilingual Speech Communication, pages 59--64, Kyoto, Japan, 2000.

[7]

M. Denecke. Rapid prototyping for spoken dialogue systems. In Proceedings of the 19th International Conference on Computational Linguistics, Taiwan, 2002.

Digital Library

[8]

J. Eisenstein and C. M. Christoudias. A salience-based approach to gesture-speech alignment. In Proceedings of the Human Language Technology conference / North American chapter of the Association for Computational Linguistics annual meeting, 2004.

[9]

M. Finke, P. Geutner, H. Hild, T. Kemp, K. Ries, and M. Westphal. The karlsruhe-verbmobil speech recognition engine. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, ICASSP-97, Munich, Germany, 1997.

Digital Library

[10]

C. Fuegen, H. Holzapfel, and A. Waibel. Tight coupling of speech recognition and dialog management -- dialog-context dependent grammar weighting for speech recognition. In Proceedings of the International Conference on Spoken Language Processing, 2004.

[11]

P. Gieselmann, C. Fuegen, H. Holzapfel, T. Schaaf, and A. Waibel. Towards multimodal communication with a household robot. In Proceedings of the International Conference on Humanoid Robots, 2003.

[12]

Third IEEE International Conference on Humanoid Robots - Humanoids, Karlsruhe, Germany, 2003.

[13]

IEEE/RSJ International Conference on Intelligent Robots and Systems, Sendai, Japan, 2004.

[14]

M. Johnston. Unification-based multimodal parsing. In COLING-ACL, pages 624--630, 1998.

Digital Library

[15]

M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker, and P. Maloor. Match: An architecture for multimodal dialogue systems. In Proceedings of ACL, 2002.

Digital Library

[16]

M. Johnston, P. R. Cohen, D. McGee, S. L. Oviatt, J. A. Pittman, and I. Smith. Unification-based multimodal integration. In Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics, pages 281--288, 1997.

Digital Library

[17]

M. Kaur, M. Tremaine, N. Huang, J. Wilder, Z. Gacovski, F. Flippo, and C. S. Mantravadi. Where is "it"? event synchronization in gaze-speech input systems. In Proceedings of the 5th International Conference on Multimodal Interfaces, pages 151--158. ACM Press, 2003.

Digital Library

[18]

S. Kettebekov, M. Yeasin, and R. Sharma. Prosody based co-analysis for continuous recognition of coverbal gestures. In Proceedings of the 4th International Conference on Multimodal Interfaces, Pittsburgh, USA, 2002.

Digital Library

[19]

S. C. Levinson. Pragmatics. Cambridge, England: Cambridge University, 1983.

[20]

K. Nickel and R. Stiefelhagen. Pointing gesture recognition based on 3d-tracking of face, hands and head orientation. In Proceedings of the Fifth International Conference on Multimodal Interfaces, Vancouver, Canada, Nov. 5-7 2003.

Digital Library

[21]

S. L. Oviatt, A. DeAngeli, and K. Kuhn. Integration and synchronization of input modes during multimodal human-computer interaction. In CHI, pages 415--422, 1997.

Digital Library

[22]

N. Reithinger, J. Alexandersson, T. Becker, A. Blocher, R. Engel, M. Löckelt, J. Müller, N. Pfleger, P. Poller, M. Streit, and V. Tschernomas. Smartkom - adaptive and flexible multimodal access to multiple applications. In Proceedings of the Fifth International Conference on Multimodal Interfaces, 2003.

Digital Library

[23]

R. Sharma, V. Pavlovic, and T. Huang. Toward multimodal human-computer interface. In Proceedings of the IEEE, volume~86, pages 853 -- 869, May 1998.

[24]

H. Soltau, F. Metze, C. Fuegen, and A. Waibel. A one pass- decoder based on polymorphic linguistic context assignment. In Proceedings of the Automatic Speech Recognition and Understanding Workshop, ASRU-2001, Madonna di Campiglio, Trento, Italy, December 2001.

[25]

R. Stiefelhagen, C. Fugen, P. Gieselmann, H. Holzapfel, K. Nickel, and A. Waibel. Natural human-robot interaction using speech, gaze and gestures. In Proceedings of the International Conference on Intelligent Robots and Systems, Sendai, Japan, 2004.

[26]

I. Wachsmuth. Communicative rhythm in gesture and speech. In Gesture-Based Communication in Human-Computer Interaction: International Gesture Workshop, GW'99, Gif-sur-Yvette, France, March 1999.

Digital Library

[27]

L. Wu, S. L. Oviatt, and P. R. Cohen. Multimodal integration - a statistical view. IEEE Transactions on Multimedia, 1(4):334--341, 1999.

Digital Library

Cited By

Fang HWeng DTian Z(2023)A Parallel Multimodal Integration Framework and Application for Cake ShoppingApplied Sciences10.3390/app1401029914:1(299)Online publication date: 29-Dec-2023
https://doi.org/10.3390/app14010299
Xia ZFeng ZYang XKong DCui H(2023)MFIRA: Multimodal Fusion Intent Recognition Algorithm for AR Chemistry ExperimentsApplied Sciences10.3390/app1314820013:14(8200)Online publication date: 14-Jul-2023
https://doi.org/10.3390/app13148200
Constantin SEyiokur FYaman DBärmann LWaibel A(2023)Multimodal Error Correction with Natural Language and Pointing Gestures2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00212(1968-1978)Online publication date: 2-Oct-2023
https://doi.org/10.1109/ICCVW60793.2023.00212
Show More Cited By

Index Terms

Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3D pointing gestures
1. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Multimodal multiparty social interaction with the furhat head
ICMI '12: Proceedings of the 14th ACM international conference on Multimodal interaction

We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is a human-like interface that utilizes facial animation for physical robot heads ...
Multimodal human discourse: gesture and speech

Gesture and speech combine to form a rich basis for human conversational interaction. To exploit these modalities in HCI, we need to understand the interplay between them and the way in which they support communication. We propose a framework for the ...
Spontaneous spoken dialogues with the furhat human-like robot head
HRI '14: Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction

Furhat [1] is a robot head that deploys a back-projected animated face that is realistic and human-like in anatomy. Furhat relies on a state-of-the-art facial animation architecture allowing accurate synchronized lip movements with speech, and the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '04: Proceedings of the 6th international conference on Multimodal interfaces

October 2004

368 pages

ISBN:1581139950

DOI:10.1145/1027933

General Chairs:
Rajeev Sharma
Advanced Interfaces
,
Trevor Darrell
Massachusetts Institute of Technology
,
Program Chairs:
Mary Harper
Purdue University, West Lafayette, IN
,
Gianni Lazzari
ITC-IRST
,
Matthew Turk
University of California, Santa Barbara, CA

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 October 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICMI04

Sponsor:

ICMI04: Sixth International Conference on Multimodal Interfaces 2004

October 13 - 15, 2004

PA, State College, USA

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

58
Total Citations
View Citations
862
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)3

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fang HWeng DTian Z(2023)A Parallel Multimodal Integration Framework and Application for Cake ShoppingApplied Sciences10.3390/app1401029914:1(299)Online publication date: 29-Dec-2023
https://doi.org/10.3390/app14010299
Xia ZFeng ZYang XKong DCui H(2023)MFIRA: Multimodal Fusion Intent Recognition Algorithm for AR Chemistry ExperimentsApplied Sciences10.3390/app1314820013:14(8200)Online publication date: 14-Jul-2023
https://doi.org/10.3390/app13148200
Constantin SEyiokur FYaman DBärmann LWaibel A(2023)Multimodal Error Correction with Natural Language and Pointing Gestures2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00212(1968-1978)Online publication date: 2-Oct-2023
https://doi.org/10.1109/ICCVW60793.2023.00212
Constantin SEyiokur FYaman DBärmann LWaibel A(2023)Interactive Multimodal Robot Dialog Using Pointing Gesture RecognitionComputer Vision – ECCV 2022 Workshops10.1007/978-3-031-25075-0_43(640-657)Online publication date: 19-Feb-2023
https://doi.org/10.1007/978-3-031-25075-0_43
Yang LFeng ZGuo QTian J(2021)An intention understanding algorithm based on multimodal fusionSecond IYSF Academic Symposium on Artificial Intelligence and Computer Engineering10.1117/12.2623101(82)Online publication date: 1-Dec-2021
https://doi.org/10.1117/12.2623101
Yuan JFeng ZDong DMeng XMeng JKong D(2020)Research on Multimodal Perceptual Navigational Virtual and Real Fusion Intelligent Experiment Equipment and AlgorithmIEEE Access10.1109/ACCESS.2020.29780898(43375-43390)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2978089
Augello AInfantino IManiscalco UPilato GVella F(2019)Introducing NarRob, a Robotic StorytellerDigital Forensics and Watermarking10.1007/978-3-030-11548-7_36(387-396)Online publication date: 22-Jan-2019
https://doi.org/10.1007/978-3-030-11548-7_36
Zimmerer CFischbach MLatoschik M(2018)Semantic Fusion for Natural Multimodal Interfaces using Concurrent Augmented Transition NetworksMultimodal Technologies and Interaction10.3390/mti20400812:4(81)Online publication date: 6-Dec-2018
https://doi.org/10.3390/mti2040081
Schwind VMayer SComeau-Vermeersch ASchweigert RHenze NMueller FJohnson DSchouten BToups Dugas PWyeth P(2018)Up to the Finger TipProceedings of the 2018 Annual Symposium on Computer-Human Interaction in Play10.1145/3242671.3242675(477-488)Online publication date: 23-Oct-2018
https://dl.acm.org/doi/10.1145/3242671.3242675
Wu JGrimsley RJafari R(2018)A robust user interface for IoT using context-aware Bayesian fusion2018 IEEE 15th International Conference on Wearable and Implantable Body Sensor Networks (BSN)10.1109/BSN.2018.8329675(126-131)Online publication date: Mar-2018
https://doi.org/10.1109/BSN.2018.8329675
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten