|
ABSTRACT
A number of recent studies have demonstrated that groups benefit considerably from access to shared visual information. This is due, in part, to the communicative efficiencies provided by the shared visual context. However, a large gap exists between our current theoretical understanding and our existing models. We address this gap by developing a computational model that integrates linguistic cues with visual cues in a way that effectively models reference during tightly-coupled, task-oriented interactions. The results demonstrate that an integrated model significantly outperforms existing language-only and visual-only models. The findings can be used to inform and augment the development of conversational agents, applications that dynamically track discourse and collaborative interactions, and dialogue managers for natural language interfaces.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
James Allen , George Ferguson , Mary Swift , Amanda Stent , Scott Stoness , Lucian Galescu , Nathan Chambers , Ellen Campana , Gregory Aist, Two diverse systems built using generic components for spoken dialogue: (recent progress on TRIPS), Proceedings of the ACL 2005 on Interactive poster and demonstration sessions, p.85-88, June 25-30, 2005, Ann Arbor, Michigan
[doi> 10.3115/1225753.1225775]
|
| |
2
|
Bangalore, S., and Johnston, M. (2004). Balancing data-driven and rule-based approaches in the context of a multimodal conversational system. In Proceedings of HLT-NAACL '04, 33--40.
|
| |
3
|
Brennan, S.E. (1995). Centering attention in discourse. Language & Cognitive Processes, 10 (2), 137--167.
|
| |
4
|
Brennan, S.E. (2005). How conversation is shaped by visual and spoken evidence. In Trueswell, J., and Tanenhaus, M. (Eds.) Approaches to studying world situated language use: Bridging the language-as-product and language-as-action traditions, pp. 95--130. MIT Press, Cambridge, MA.
|
| |
5
|
|
| |
6
|
Byron, D.K., Mampilly, T., Sharma, V., and Xu, T. (2005). Utilizing visual attention for cross-modal coreference interpretation. In Proceedings of CONTEXT '05, 83--96.
|
| |
7
|
Byron, D.K., and Stoia, L. (2005). An analysis of proximity markers in collaborative dialog. Appeared at 41st annual meeting of the Chicago Linguistic Society.
|
| |
8
|
Cassell, J. (2004). Towards a Model of Technology and Literacy Development: Story Listening Systems. Journal of Applied Developmental Psychology, 25 (1), 75--105.
|
| |
9
|
|
 |
10
|
Joyce Y. Chai , Zahar Prasov , Joseph Blaim , Rong Jin, Linguistic theories in efficient multimodal reference resolution: an empirical investigation, Proceedings of the 10th international conference on Intelligent user interfaces, January 10-13, 2005, San Diego, California, USA
[doi> 10.1145/1040830.1040850]
|
| |
11
|
Chomsky, N. (1982). Some concepts and consequences of the theory of government and binding. MIT Press, Cambridge, MA.
|
| |
12
|
Clark, H.H., and Brennan, S.E. (1991). Grounding in communication. In Resnick, L., Levine, J., and Teasley, S. (Eds.) Perspectives on socially shared cognition, pp. 127--149. APA, Washington DC.
|
| |
13
|
Clark, H.H., and Wilkes-Gibbs, D. (1986). Referring as a collaborative process. Cognition, 22, 1--39.
|
| |
14
|
Daelemans, W., Buchholz, S., and Veenstra, J. (1999). Memory-based shallow parsing. In Proceedings of CoNLL Workshop, 53--60.
|
| |
15
|
Daelemans, W., Zavrel, J., van der Sloot, K., and van den Bosch, A. (2001). TiMBL: Tilburg memory based learner, version 4.0, reference guide. Technical Report ILK Technical Report 00-01, Tilburg University.
|
| |
16
|
David DeVault , Natalia Kariaeva , Anubha Kothari , Iris Oved , Matthew Stone, An information-state approach to collaborative reference, Proceedings of the ACL 2005 on Interactive poster and demonstration sessions, p.1-4, June 25-30, 2005, Ann Arbor, Michigan
[doi> 10.3115/1225753.1225754]
|
| |
17
|
Pinar Dönmez , Carolyn Rosé , Karsten Stegmann , Armin Weinberger , Frank Fischer, Supporting CSCL with automatic corpus analysis technology, Proceedings of th 2005 conference on Computer support for collaborative learning: learning 2005: the next 10 years!, p.125-134, May 30-June 04, 2005, Taipei, Taiwan
|
| |
18
|
Eisenstein, J., and Christoudias, C.M. (2004). A salience-based approach to gesture-speech alignment. In Proceedings of HLT-NAACL-04, 25--32.
|
| |
19
|
Fussell, S.R., Setlock, L.D., Yang, J., Ou, J., Mauer, E.M., and Kramer, A. (2004). Gestures over video streams to support remote collaboration on physical tasks. Human-Computer Interaction, 19, 273--309.
|
| |
20
|
|
 |
21
|
|
 |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
Gundel, J.K., Hedberg, N., and Zacharski, R. (1993). Cognitive status and the form of referring expressions in discourse. Language, 69 (2), 274--307.
|
 |
27
|
Gahgene Gweon , Carolyn Rose , Regan Carey , Zachary Zaiss, Providing support for adaptive scripting in an on-line collaborative learning environment, Proceedings of the SIGCHI conference on Human Factors in computing systems, April 22-27, 2006, Montréal, Québec, Canada
[doi> 10.1145/1124772.1124810]
|
| |
28
|
|
| |
29
|
Hobbs, J.R. (1978). Resolving pronoun references. Lingua, 44, 311--338.
|
| |
30
|
Hudson, S.B., Tanenhaus, M.K., and Dell, G.S. (1986). The effect of the discourse center on the local coherence of a discourse. In Proceedings of the Cognitive Science Society -- 86, 96--101. Lawrence Erlbaum Associates.
|
| |
31
|
|
| |
32
|
|
| |
33
|
Kraut, R.E., Fussell, S.R., and Siegel, J. (2003). Visual information as a conversational resource in collaborative physical tasks. Human-Computer Interaction, 18(1), 13--49.
|
 |
34
|
|
| |
35
|
Kumar, R., Rosé, C.P., Aleven, V., Iglesias, A., and Robinson, A. (2006). Evaluating the effectiveness of tutorial dialogue instruction in an exploratory learning context. In Proceedings of the Intelligent Tutoring Systems Conference.
|
| |
36
|
Mitkov, R. (2000). Towards a more consistent and comprehensive evaluation of anaphora resolution algorithms and systems. In Proceedings of DAARC 2000, 96--107.
|
| |
37
|
Müller, C. (2006). Automatic detection of non-referential it in spoken multi-party dialog. In Proceedings of EACL 2006, 49--56.
|
| |
38
|
Bonnie A. Nardi , Heinrich Schwarz , Allan Kuchinsky , Robert Leichner , Steve Whittaker , Robert Sclabassi, Turning away from talking heads: the use of video-as-data in neurosurgery, Proceedings of the INTERCHI '93 conference on Human factors in computing systems, p.327-334, May 1993, Amsterdam, The Netherlands
|
 |
39
|
Jiazhi Ou , Lui Min Oh , Susan R. Fussell , Tal Blum , Jie Yang, Analyzing and predicting focus of attention in remote collaborative tasks, Proceedings of the 7th international conference on Multimodal interfaces, October 04-06, 2005, Torento, Italy
[doi> 10.1145/1088463.1088485]
|
| |
40
|
Oviatt, S., Levow, G.-A., Moreton, E., and MacEachern, M. (1998). Modeling global and focal hyperarticulation during human-computer error resolution. Journal of the Acoustical Society of America, 104 (5), 3080--3091.
|
| |
41
|
|
| |
42
|
Oviatt, S.L. (1997). Multimodal interactive maps: Designing for human performance. Human-Computer Interaction, 12, 93--129.
|
 |
43
|
Sharon Oviatt , Antonella DeAngeli , Karen Kuhn, Integration and synchronization of input modes during multimodal human-computer interaction, Proceedings of the SIGCHI conference on Human factors in computing systems, p.415-422, March 22-27, 1997, Atlanta, Georgia, United States
[doi> 10.1145/258549.258821]
|
 |
44
|
|
| |
45
|
Scholl, B.J. (2001). Objects and attention: the state of the art. Cognition, 80, 1--46.
|
| |
46
|
|
| |
47
|
|
| |
48
|
|
| |
49
|
|
| |
50
|
|
| |
51
|
|
| |
52
|
Whittaker, S. (2003). Things to talk about when talking about things. Human-Computer Interaction, 18 (1-2), 149--170.
|
CITED BY
|
Mary Ellen Foster , Ellen Gurman Bard , Markus Guhe , Robin L. Hill , Jon Oberlander , Alois Knoll, The roles of haptic-ostensive referring expressions in cooperative, task-based human-robot dialogue, Proceedings of the 3rd ACM/IEEE international conference on Human robot interaction, March 12-15, 2008, Amsterdam, The Netherlands
|
|