research-article

Grounding spatial prepositions for video search

Authors:

Stefanie Tellex,

Deb RoyAuthors Info & Claims

ICMI-MLMI '09: Proceedings of the 2009 international conference on Multimodal interfaces

Pages 253 - 260

https://doi.org/10.1145/1647314.1647369

Published: 02 November 2009 Publication History

Abstract

Spatial language video retrieval is an important real-world problem that forms a test bed for evaluating semantic structures for natural language descriptions of motion on naturalistic data. Video search by natural language query requires that linguistic input be converted into structures that operate on video in order to find clips that match a query. This paper describes a framework for grounding the meaning of spatial prepositions in video. We present a library of features that can be used to automatically classify a video clip based on whether it matches a natural language query. To evaluate these features, we collected a corpus of natural language descriptions about the motion of people in video clips. We characterize the language used in the corpus, and use it to train and test models for the meanings of the spatial prepositions "to," "across," "through," "out," "along," "towards," and "around." The classifiers can be used to build a spatial language video retrieval system that finds clips matching queries such as "across the kitchen."

References

[1]

R. Cucchiara, C. Grana, A. Prati, and R. Vezzani. Computer vision techniques for PDA accessibility of in-house video surveillance. In First ACM SIGMM International Workshop on Video surveillance, pages 87--97, Berkeley, California, 2003. ACM.

Digital Library

[2]

J. Demsar and B. Zupan. Orange: From experimental machine learning to interactive data mining. Technical report, Faculty of Computer and Information Science, University of Ljubljana, 2004. URL http://www.ailab.si/orange.

[3]

M. Fleischman, P. DeCamp, and D. Roy. Mining temporal patterns of movement for video content classification. In Proceedings of the 8th ACM SIGMM International Workshop on Multimedia Information Retrieval, 2006.

Digital Library

[4]

S. Harada, Y. Itoh, and H. Nakatani. Interactive image retrieval by natural language. Optical Engineering, 36 (12):3281--3287, Dec. 1997.

[5]

Y. Ivanov, A. Sorokin, C. Wren, and I. Kaur. Tracking people in mixed modality systems. Technical Report TR2007-011, Mitsubishi Electric Research Laboratories, 2007.

[6]

Y. A. Ivanov and C. R. Wren. Toward spatial queries for spatial surveillance tasks. In Pervasive: Workshop Pervasive Technology Applied Real-World Experiences with RFID and Sensor Networks (PTA), 2006.

[7]

R. S. Jackendo . Semantics and Cognition, pages 161--187. MIT Press, 1983.

[8]

P. Jodoin, J. Konrad, and V. Saligrama. Modeling background activity for behavior subtraction. In Distributed Smart Cameras, 2008. ICDSC 2008. Second ACM/IEEE International Conference on, pages 1--10, 2008.

[9]

B. Katz, J. Lin, C. Stau er, and E. Grimson. Answering questions about moving objects in surveillance videos. In M. Maybury, editor, New Directions in Question Answering, pages 113--124. Springer, 2004.

[10]

J. D. Kelleher and F. J. Costello. Applying computational models of spatial prepositions to visually situated dialog. Computational Linguistics, 35(2):271--306, June 2009.

Digital Library

[11]

B. Landau and R. Jackendo . "What" and "where" in spatial language and spatial cognition. Behavioral and Brain Sciences, 16:217--265, 1993.

[12]

X. Li, D. Wang, J. Li, and B. Zhang. Video search in concept subspace: a text-like paradigm. In Proceedings of the 6th ACM international conference on Image and video retrieval, pages 603--610, Amsterdam, The Netherlands, 2007. ACM.

Digital Library

[13]

T. Lochmatter, P. Roduit, C. Cianci, N. Correll, J. Jacot, and A. Martinoli. Swistrack - a flexible open source tracking software for multi-agent systems. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2008.

[14]

G. Marton and L. B. Westrick. Sepia: a framework for natural language semantics. Technical report, Massachusetts Institute of Technology, 2009.

[15]

M. Naphade, J. Smith, J. Tesic, S. Chang, W. Hsu, L. Kennedy, A. Hauptmann, and J. Curtis. Large-scale concept ontology for multimedia. Multimedia, IEEE, 13(3):86--91, 2006.

Digital Library

[16]

T. Regier and M. Zheng. Attention to endpoints: A Cross-Linguistic constraint on spatial meaning. Cognitive Science, 31(4):705, 2007.

[17]

T. P. Regier. The Acquisition of Lexical Semantics for Spatial Terms: A Connectionist Model of Perceptual Categorization. PhD thesis, University of California at Berkeley, 1992.

Digital Library

[18]

W. Ren, S. Singh, M. Singh, and Y. Zhu. State-of-the-art on spatio-temporal information-based video retrieval. Pattern Recognition, 42(2):267--282, Feb. 2009. ISSN 0031-3203.

Digital Library

[19]

D. Roy, R. Patel, P. DeCamp, R. Kubat, M. Fleischman, B. Roy, N. Mavridis, S. Tellex, A. Salata, J. Guinness, M. Levit, and P. Gorniak. The Human Speechome Project. In Proceedings of the 28th Annual Cognitive Science Conference, pages 192--196, 2006.

Digital Library

[20]

L. Talmy. The fundamental system of spatial schemas in language. In B. Hamp, editor, From Perception to Meaning: Image Schemas in Cognitive Linguistics. Mouton de Gruyter, 2005.

[21]

S. Tellex and D. Roy. Towards surveillance video search by natural language query. In Conference on Image and Video Retrieval (CIVIR-2009), 2009.

Digital Library

[22]

J. Vlahos. Welcome to the panopticon. Popular Mechanics, 185(1):64, 2008. ISSN 00324558.

[23]

T. Yamasaki, Y. Nishioka, and K. Aizawa. Interactive retrieval for multi-camera surveillance systems featuring spatio-temporal summarization. In Proceeding of the 16th ACM international conference on Multimedia, pages 797--800, Vancouver, British Columbia, Canada, 2008. ACM.

Digital Library

[24]

A. Yoshitaka, Y. Hosoda, M. Yoshimitsu, M. Hirakawa, and T. Ichikawa. Violone: Video retrieval by motion example. Journal of Visual Languages and Computing, 7:423--443, 1996.

[25]

L. Zhang, Y. Li, and R. Nevatia. Global data association for multi-object tracking using network flows. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1--8, 2008.

Cited By

Radke MGupta AStock KJones C(2022)Disambiguating spatial prepositions: The case of geo‐spatial sense detectionTransactions in GIS10.1111/tgis.1297626:6(2621-2650)Online publication date: 6-Sep-2022
https://doi.org/10.1111/tgis.12976
Xi XZhu S(2022)A comprehensive review of task understanding of command-triggered execution of tasks for service robotsArtificial Intelligence Review10.1007/s10462-022-10347-656:7(7137-7193)Online publication date: 6-Dec-2022
https://dl.acm.org/doi/10.1007/s10462-022-10347-6
Tellex SGopalan NKress-Gazit HMatuszek C(2020)Robots That Use LanguageAnnual Review of Control, Robotics, and Autonomous Systems10.1146/annurev-control-101119-0716283:1(25-55)Online publication date: 3-May-2020
https://doi.org/10.1146/annurev-control-101119-071628
Show More Cited By

Index Terms

Grounding spatial prepositions for video search
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Towards surveillance video search by natural language query
CIVR '09: Proceedings of the ACM International Conference on Image and Video Retrieval

Spatial language video retrieval is an important real-world problem that is also a natural test bed for evaluating semantic structures for natural language descriptions of motion on naturalistic data. This paper describes first steps towards a system ...
Identifying and modelling polysemous senses of spatial prepositions in referring expressions
Abstract
In this paper we analyse the issue of reference using spatial language and examine how the polysemy exhibited by spatial prepositions can be incorporated into semantic models for situated dialogue. After providing a brief overview of ...
Handling of prepositions in English to Bengali machine translation
Prepositions '06: Proceedings of the Third ACL-SIGSEM Workshop on Prepositions

The present study focuses on the lexical meanings of prepositions rather than on the thematic meanings because it is intended for use in an English-Bengali machine translation (MT) system, where the meaning of a lexical unit must be preserved in the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI-MLMI '09: Proceedings of the 2009 international conference on Multimodal interfaces

November 2009

374 pages

ISBN:9781605587721

DOI:10.1145/1647314

General Chairs:
James L. Crowley
INRIA Grenoble Rhône-Alpes Research Centre, France
,
Yuri Ivanov
MERL, USA
,
Christopher Wren
Google, USA
,
Program Chairs:
Daniel Gatica-Perez
Idiap Research Institute, Switzerland
,
Michael Johnston
AT&T Research, USA
,
Rainer Stiefelhagen
University of Karlsruhe, Germany

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICMI-MLMI '09

Sponsor:

SIGCHI

ICMI-MLMI '09: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES/WORKSHOP ON MACHINE LEARNING FOR MULTIMODAL INTERFACES

November 2 - 4, 2009

Massachusetts, Cambridge, USA

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
115
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Radke MGupta AStock KJones C(2022)Disambiguating spatial prepositions: The case of geo‐spatial sense detectionTransactions in GIS10.1111/tgis.1297626:6(2621-2650)Online publication date: 6-Sep-2022
https://doi.org/10.1111/tgis.12976
Xi XZhu S(2022)A comprehensive review of task understanding of command-triggered execution of tasks for service robotsArtificial Intelligence Review10.1007/s10462-022-10347-656:7(7137-7193)Online publication date: 6-Dec-2022
https://dl.acm.org/doi/10.1007/s10462-022-10347-6
Tellex SGopalan NKress-Gazit HMatuszek C(2020)Robots That Use LanguageAnnual Review of Control, Robotics, and Autonomous Systems10.1146/annurev-control-101119-0716283:1(25-55)Online publication date: 3-May-2020
https://doi.org/10.1146/annurev-control-101119-071628
Bajcsy RAloimonos YTsotsos J(2018)Revisiting active perceptionAutonomous Robots10.1007/s10514-017-9615-342:2(177-196)Online publication date: 1-Feb-2018
https://dl.acm.org/doi/10.1007/s10514-017-9615-3
Golland DLiang PKlein DLi HMàrquez L(2010)A game-theoretic approach to generating spatial descriptionsProceedings of the 2010 Conference on Empirical Methods in Natural Language Processing10.5555/1870658.1870698(410-419)Online publication date: 9-Oct-2010
https://dl.acm.org/doi/10.5555/1870658.1870698
Kollar TTellex SRoy DRoy NHinds PIshiguro HKanda TKahn P(2010)Toward understanding natural language directionsProceedings of the 5th ACM/IEEE international conference on Human-robot interaction10.5555/1734454.1734553(259-266)Online publication date: 2-Mar-2010
https://dl.acm.org/doi/10.5555/1734454.1734553
Huang ATellex SBachrach AKollar TRoy DRoy N(2010)Natural language command of an autonomous micro-air vehicle2010 IEEE/RSJ International Conference on Intelligent Robots and Systems10.1109/IROS.2010.5650910(2663-2669)Online publication date: Oct-2010
https://doi.org/10.1109/IROS.2010.5650910
Kollar TTellex SRoy DRoy N(2010)Toward understanding natural language directions2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI)10.1109/HRI.2010.5453186(259-266)Online publication date: Mar-2010
https://doi.org/10.1109/HRI.2010.5453186

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten