skip to main content
10.1145/1647314.1647369acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Grounding spatial prepositions for video search

Published: 02 November 2009 Publication History

Abstract

Spatial language video retrieval is an important real-world problem that forms a test bed for evaluating semantic structures for natural language descriptions of motion on naturalistic data. Video search by natural language query requires that linguistic input be converted into structures that operate on video in order to find clips that match a query. This paper describes a framework for grounding the meaning of spatial prepositions in video. We present a library of features that can be used to automatically classify a video clip based on whether it matches a natural language query. To evaluate these features, we collected a corpus of natural language descriptions about the motion of people in video clips. We characterize the language used in the corpus, and use it to train and test models for the meanings of the spatial prepositions "to," "across," "through," "out," "along," "towards," and "around." The classifiers can be used to build a spatial language video retrieval system that finds clips matching queries such as "across the kitchen."

References

[1]
R. Cucchiara, C. Grana, A. Prati, and R. Vezzani. Computer vision techniques for PDA accessibility of in-house video surveillance. In First ACM SIGMM International Workshop on Video surveillance, pages 87--97, Berkeley, California, 2003. ACM.
[2]
J. Demsar and B. Zupan. Orange: From experimental machine learning to interactive data mining. Technical report, Faculty of Computer and Information Science, University of Ljubljana, 2004. URL http://www.ailab.si/orange.
[3]
M. Fleischman, P. DeCamp, and D. Roy. Mining temporal patterns of movement for video content classification. In Proceedings of the 8th ACM SIGMM International Workshop on Multimedia Information Retrieval, 2006.
[4]
S. Harada, Y. Itoh, and H. Nakatani. Interactive image retrieval by natural language. Optical Engineering, 36 (12):3281--3287, Dec. 1997.
[5]
Y. Ivanov, A. Sorokin, C. Wren, and I. Kaur. Tracking people in mixed modality systems. Technical Report TR2007-011, Mitsubishi Electric Research Laboratories, 2007.
[6]
Y. A. Ivanov and C. R. Wren. Toward spatial queries for spatial surveillance tasks. In Pervasive: Workshop Pervasive Technology Applied Real-World Experiences with RFID and Sensor Networks (PTA), 2006.
[7]
R. S. Jackendo . Semantics and Cognition, pages 161--187. MIT Press, 1983.
[8]
P. Jodoin, J. Konrad, and V. Saligrama. Modeling background activity for behavior subtraction. In Distributed Smart Cameras, 2008. ICDSC 2008. Second ACM/IEEE International Conference on, pages 1--10, 2008.
[9]
B. Katz, J. Lin, C. Stau er, and E. Grimson. Answering questions about moving objects in surveillance videos. In M. Maybury, editor, New Directions in Question Answering, pages 113--124. Springer, 2004.
[10]
J. D. Kelleher and F. J. Costello. Applying computational models of spatial prepositions to visually situated dialog. Computational Linguistics, 35(2):271--306, June 2009.
[11]
B. Landau and R. Jackendo . "What" and "where" in spatial language and spatial cognition. Behavioral and Brain Sciences, 16:217--265, 1993.
[12]
X. Li, D. Wang, J. Li, and B. Zhang. Video search in concept subspace: a text-like paradigm. In Proceedings of the 6th ACM international conference on Image and video retrieval, pages 603--610, Amsterdam, The Netherlands, 2007. ACM.
[13]
T. Lochmatter, P. Roduit, C. Cianci, N. Correll, J. Jacot, and A. Martinoli. Swistrack - a flexible open source tracking software for multi-agent systems. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2008.
[14]
G. Marton and L. B. Westrick. Sepia: a framework for natural language semantics. Technical report, Massachusetts Institute of Technology, 2009.
[15]
M. Naphade, J. Smith, J. Tesic, S. Chang, W. Hsu, L. Kennedy, A. Hauptmann, and J. Curtis. Large-scale concept ontology for multimedia. Multimedia, IEEE, 13(3):86--91, 2006.
[16]
T. Regier and M. Zheng. Attention to endpoints: A Cross-Linguistic constraint on spatial meaning. Cognitive Science, 31(4):705, 2007.
[17]
T. P. Regier. The Acquisition of Lexical Semantics for Spatial Terms: A Connectionist Model of Perceptual Categorization. PhD thesis, University of California at Berkeley, 1992.
[18]
W. Ren, S. Singh, M. Singh, and Y. Zhu. State-of-the-art on spatio-temporal information-based video retrieval. Pattern Recognition, 42(2):267--282, Feb. 2009. ISSN 0031-3203.
[19]
D. Roy, R. Patel, P. DeCamp, R. Kubat, M. Fleischman, B. Roy, N. Mavridis, S. Tellex, A. Salata, J. Guinness, M. Levit, and P. Gorniak. The Human Speechome Project. In Proceedings of the 28th Annual Cognitive Science Conference, pages 192--196, 2006.
[20]
L. Talmy. The fundamental system of spatial schemas in language. In B. Hamp, editor, From Perception to Meaning: Image Schemas in Cognitive Linguistics. Mouton de Gruyter, 2005.
[21]
S. Tellex and D. Roy. Towards surveillance video search by natural language query. In Conference on Image and Video Retrieval (CIVIR-2009), 2009.
[22]
J. Vlahos. Welcome to the panopticon. Popular Mechanics, 185(1):64, 2008. ISSN 00324558.
[23]
T. Yamasaki, Y. Nishioka, and K. Aizawa. Interactive retrieval for multi-camera surveillance systems featuring spatio-temporal summarization. In Proceeding of the 16th ACM international conference on Multimedia, pages 797--800, Vancouver, British Columbia, Canada, 2008. ACM.
[24]
A. Yoshitaka, Y. Hosoda, M. Yoshimitsu, M. Hirakawa, and T. Ichikawa. Violone: Video retrieval by motion example. Journal of Visual Languages and Computing, 7:423--443, 1996.
[25]
L. Zhang, Y. Li, and R. Nevatia. Global data association for multi-object tracking using network flows. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1--8, 2008.

Cited By

View all
  • (2022)Disambiguating spatial prepositions: The case of geo‐spatial sense detectionTransactions in GIS10.1111/tgis.1297626:6(2621-2650)Online publication date: 6-Sep-2022
  • (2022)A comprehensive review of task understanding of command-triggered execution of tasks for service robotsArtificial Intelligence Review10.1007/s10462-022-10347-656:7(7137-7193)Online publication date: 6-Dec-2022
  • (2020)Robots That Use LanguageAnnual Review of Control, Robotics, and Autonomous Systems10.1146/annurev-control-101119-0716283:1(25-55)Online publication date: 3-May-2020
  • Show More Cited By

Index Terms

  1. Grounding spatial prepositions for video search

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMI-MLMI '09: Proceedings of the 2009 international conference on Multimodal interfaces
    November 2009
    374 pages
    ISBN:9781605587721
    DOI:10.1145/1647314
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 November 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. spatial language
    2. video retrieval

    Qualifiers

    • Research-article

    Conference

    ICMI-MLMI '09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 453 of 1,080 submissions, 42%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Disambiguating spatial prepositions: The case of geo‐spatial sense detectionTransactions in GIS10.1111/tgis.1297626:6(2621-2650)Online publication date: 6-Sep-2022
    • (2022)A comprehensive review of task understanding of command-triggered execution of tasks for service robotsArtificial Intelligence Review10.1007/s10462-022-10347-656:7(7137-7193)Online publication date: 6-Dec-2022
    • (2020)Robots That Use LanguageAnnual Review of Control, Robotics, and Autonomous Systems10.1146/annurev-control-101119-0716283:1(25-55)Online publication date: 3-May-2020
    • (2018)Revisiting active perceptionAutonomous Robots10.1007/s10514-017-9615-342:2(177-196)Online publication date: 1-Feb-2018
    • (2010)A game-theoretic approach to generating spatial descriptionsProceedings of the 2010 Conference on Empirical Methods in Natural Language Processing10.5555/1870658.1870698(410-419)Online publication date: 9-Oct-2010
    • (2010)Toward understanding natural language directionsProceedings of the 5th ACM/IEEE international conference on Human-robot interaction10.5555/1734454.1734553(259-266)Online publication date: 2-Mar-2010
    • (2010)Natural language command of an autonomous micro-air vehicle2010 IEEE/RSJ International Conference on Intelligent Robots and Systems10.1109/IROS.2010.5650910(2663-2669)Online publication date: Oct-2010
    • (2010)Toward understanding natural language directions2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI)10.1109/HRI.2010.5453186(259-266)Online publication date: Mar-2010

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media