Article

Salience modeling based on non-verbal modalities for spoken language understanding

Authors:

Joyce Y. ChaiAuthors Info & Claims

ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces

Pages 193 - 200

https://doi.org/10.1145/1180995.1181036

Published: 02 November 2006 Publication History

Abstract

Previous studies have shown that, in multimodal conversational systems, fusing information from multiple modalities together can improve the overall input interpretation through mutual disambiguation. Inspired by these findings, this paper investigates non-verbal modalities, in particular deictic gesture, in spoken language processing. Our assumption is that during multimodal conversation, user's deictic gestures on the graphic display can signal the underlying domain model that is salient at that particular point of interaction. This salient domain model can be used to constrain hypotheses for spoken language processing. Based on this assumption, this paper examines different configurations of salience driven language models (e.g., n-gram and probabilistic context free grammar) for spoken language processing across different stages. Our empirical results have shown the potential of integrating salience models based on non-verbal modalities in spoken language understanding.

References

[1]

P. F. Brown, V. J. D. Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467--479, 1992.

Digital Library

[2]

J. Chai, P. Hong, M. Zhou, and Z. Prasov. Optimization in multimodal interpretation. In Proceedings of 42nd Annual Meeting of Association for Computational Linguistics (ACL), 2004.

Digital Library

[3]

J. Chai and S. Qu. A salience driven approach to robust input interpretation in multimodal conversational systems. In Conferences on Human Language Technology and Empirical Methods in Natural Language Processing (HLT/EMNLP), 2005.

Digital Library

[4]

J. Y. Chai, P. Hong, and M. X. Zhou. A probabilistic approach to reference resolution in multimodal user interfaces. In Proceedings of IUI'04, pages 70--77, 2004.

Digital Library

[5]

C. Huls, E. Bos, and W. Classen. Automatic referent resolution of deictic and anaphoric expressions. Computational Linguistics, 21(1):59--79, 1995.

Digital Library

[6]

E. J. and C. C. A salience-based approach to gesture-speech alignment. In Proceedings of HLT/NAACL'04, 2004.

[7]

M. Johnston. Unification-based multimodal parsing. In Proceedings of COLING-ACL'98, 1998.

Digital Library

[8]

M. Johnston and S. Bangalore. Finite-state multimodal parsing and understanding. In Proceedings of COLING'00, 2000.

Digital Library

[9]

S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recogniser. IEEE Trans. Acoustics, Speech and Signal Processing, 35(3):400--401, 1987.

[10]

A. Kehler. Cognitive status and form of reference in multimodal human-computer interaction. In Proceedings of AAAI'00, pages 685--689, 2000.

Digital Library

[11]

D. Klein and C. D. Manning. Accurate unlexicalized parsing. In Proceedings of the 41st Meeting of the Association for Computational Linguistics, 2003.

Digital Library

[12]

S. Lappin and H. Leass. An algorithm for pronominal anaphora resolution. Computational Linguistics, (4):535--561, 1994.

Digital Library

[13]

S. Oviatt. Breaking the robustness barrier: Recent progress on the design of robust multimodal systems. Advances in Computers, 56:305--325, 2002.

[14]

S. L. Oviatt. Mutual disambiguation of recognition errors in a multimodal architecture. In Proceedings of Conference on Human Factors in Computing Systems: CHI'99, 1999.

Digital Library

[15]

D. Roy and N. Mukherjee. Towards situated speech understanding: Visual context priming of language models. Computer Speech and Language, 19(2):227--248, 2005.

[16]

R. Stevenson. The role of salience in the production of referring expressions: A psycholinguistic perspective. In K. van Deemter and R. Kibble, editors, Information Sharing. CSLI Publ., 2002.

[17]

P. Taylor, S. King, S. Isard, H. Wright, and J. Kowtko. Using intonation to constrain language models in speech recognition. In Proceedings of ICASSP, 1997.

[18]

W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and J. Woelfel. Sphinx-4: A flexible open source framework for speech recognition. Technical Report TR-2004-139, Sun Microsystems Laboratories, 2004.

Digital Library

Cited By

Parandhama A(2024)Empowering Adolescent Emergent Readers in Government Schools: An Exploration of Multimodal Texts as Pathways to ComprehensionLiteracy Research and Instruction10.1080/19388071.2024.2357101(1-20)Online publication date: 30-May-2024
https://doi.org/10.1080/19388071.2024.2357101
Miki MKitaoka NMiyajima CNishino TTakeda K(2014)Improvement of multimodal gesture and speech recognition performance using time intervals between gestures and accompanying speechEURASIP Journal on Audio, Speech, and Music Processing10.1186/1687-4722-2014-22014:1Online publication date: 13-Jan-2014
https://doi.org/10.1186/1687-4722-2014-2
Pui-Yu Hui Meng H(2014)Latent Semantic Analysis for Multimodal User Input With Speech and GesturesIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2013.229458622:2(417-429)Online publication date: 1-Feb-2014
https://dl.acm.org/doi/10.1109/TASLP.2013.2294586
Show More Cited By

Index Terms

Salience modeling based on non-verbal modalities for spoken language understanding
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI theory, concepts and models
    2. Interaction paradigms
      1. Natural language interfaces
  2. Interaction design
    1. Interaction design theory, concepts and paradigms

Recommendations

Eye Gaze for Spoken Language Understanding in Multi-modal Conversational Interactions
ICMI '14: Proceedings of the 16th International Conference on Multimodal Interaction

When humans converse with each other, they naturally amalgamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a ...
Incorporating non-verbal modalities in spoken language understanding for multimodal conversational systems
Learning Dialogue History for Spoken Language Understanding
Natural Language Processing and Chinese Computing
Abstract
In task-oriented dialogue systems, spoken language understanding (SLU) aims to convert users’ queries expressed by natural language to structured representations. SLU usually consists of two parts, namely intent identification and slot filling. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces

November 2006

404 pages

ISBN:159593541X

DOI:10.1145/1180995

General Chairs:
Francis Quek
Virginia Tech, USA
,
Jie Yang
Carnegie Mellon University, USA
,
Program Chairs:
Dominic Massaro
University of California, Santa Cruz, USA
,
Abeer Alwan
University of California, Los Angeles, USA
,
Timothy J. Hazen
Massachusetts Institute of Technology, USA

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICMI06

Sponsor:

ICMI06: 8th International Conference on Multimodal Interfaces 2006

November 2 - 4, 2006

Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
174
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Parandhama A(2024)Empowering Adolescent Emergent Readers in Government Schools: An Exploration of Multimodal Texts as Pathways to ComprehensionLiteracy Research and Instruction10.1080/19388071.2024.2357101(1-20)Online publication date: 30-May-2024
https://doi.org/10.1080/19388071.2024.2357101
Miki MKitaoka NMiyajima CNishino TTakeda K(2014)Improvement of multimodal gesture and speech recognition performance using time intervals between gestures and accompanying speechEURASIP Journal on Audio, Speech, and Music Processing10.1186/1687-4722-2014-22014:1Online publication date: 13-Jan-2014
https://doi.org/10.1186/1687-4722-2014-2
Pui-Yu Hui Meng H(2014)Latent Semantic Analysis for Multimodal User Input With Speech and GesturesIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2013.229458622:2(417-429)Online publication date: 1-Feb-2014
https://dl.acm.org/doi/10.1109/TASLP.2013.2294586
Chen LHarper M(2011)Utilizing gestures to improve sentence boundary detectionMultimedia Tools and Applications10.1007/s11042-009-0436-z51:3(1035-1067)Online publication date: 1-Feb-2011
https://dl.acm.org/doi/10.1007/s11042-009-0436-z
Hui PLo WMeng HRich CYang QCavazza MZhou M(2010)Usage patterns and latent semantic analyses for task goal inference of multimodal user interactionsProceedings of the 15th international conference on Intelligent user interfaces10.1145/1719970.1719989(129-138)Online publication date: 7-Feb-2010
https://dl.acm.org/doi/10.1145/1719970.1719989
Qu SChai JHealey PPieraccini RByron DYoung S(2009)The role of interactivity in human-machine conversation for automatic word acquisitionProceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue10.5555/1708376.1708404(188-195)Online publication date: 11-Sep-2009
https://dl.acm.org/doi/10.5555/1708376.1708404
Hui PMeng H(2009)Cross-Modality Semantic Integration With Hypothesis Rescoring for Robust Interpretation of Multimodal User InteractionsIEEE Transactions on Audio, Speech, and Language Processing10.1109/TASL.2008.201150917:3(486-500)Online publication date: Mar-2009
https://doi.org/10.1109/TASL.2008.2011509
Choumane ASiroux JDigalakis VPotamianos ATurk MPieraccini RIvanov Y(2008)Knowledge and data flow architecture for reference processing in multimodal dialog systemsProceedings of the 10th international conference on Multimodal interfaces10.1145/1452392.1452413(105-108)Online publication date: 20-Oct-2008
https://dl.acm.org/doi/10.1145/1452392.1452413
Miki MMiyajima CNishino TKitaoka NTakeda KDigalakis VPotamianos ATurk MPieraccini RIvanov Y(2008)An integrative recognition method for speech and gesturesProceedings of the 10th international conference on Multimodal interfaces10.1145/1452392.1452411(93-96)Online publication date: 20-Oct-2008
https://dl.acm.org/doi/10.1145/1452392.1452411
Qu SChai JStaab S(2008)Beyond attentionProceedings of the 13th international conference on Intelligent user interfaces10.1145/1378773.1378805(237-246)Online publication date: 13-Jan-2008
https://dl.acm.org/doi/10.1145/1378773.1378805
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten