short-paper

Localizing volumetric motion for action recognition in realistic videos

Authors:
Xiao Wu

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
View Profile

,
Chong-Wah Ngo

Department of Computer Science, City University of Hong Kong, Hong Kong, China

Department of Computer Science, City University of Hong Kong, Hong Kong, China
View Profile

,
Jintao Li

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
View Profile

,
Yongdong Zhang

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
View Profile

MM '09: Proceedings of the 17th ACM international conference on MultimediaOctober 2009Pages 505–508https://doi.org/10.1145/1631272.1631342

Published:19 October 2009Publication History

MM '09: Proceedings of the 17th ACM international conference on Multimedia

Pages 505–508

ABSTRACT

This paper presents a novel motion localization approach for recognizing actions and events in real videos. Examples include StandUp and Kiss in Hollywood movies. The challenge can be attributed to the large visual and motion variations imposed by realistic action poses. Previous works mainly focus on learning from descriptors of cuboids around space time interest points (STIP) to characterize actions. The size, shape and space-time position of cuboids are fixed without considering the underlying motion dynamics. This often results in large set of fragmentized cuboids which fail to capture long-term dynamic properties of realistic actions. This paper proposes the detection of spatio-temporal motion volumes (namely Volume of Interest, VOI) of scale and position adaptive to localize actions. First, motions are described as bags of point trajectories by tracking keypoints along the time dimension. VOIs are then adaptively extracted by clustering trajectory on the motion mainfold. The resulting VOIs, of varying scales and centering at arbitrary positions depending on motion dynamics, are eventually described by SIFT and 3D gradient features for action recognition. Comparing with fixed-size cuboids, VOI allows comprehensive modeling of long-term motion and shows better capability in capturing contextual information associated with motion dynamics. Experiments on a realistic Hollywood movie dataset show that the proposed approach can achieve 20\% relative improvement compared to the state-of-the-art STIP based algorithm.

References

J. Sun, X. Wu, SC. Yan, LF. Cheong, TS. Chua and J. Li. Hierarchical spatio-temporal context modeling for action recognition. CVPR, 2009.Google Scholar
F. Wang, Y. Jiang, and C. Ngo. Video event detection using motion relativity and visual relatedness. ACM Multimedia, 2008. Google ScholarDigital Library
J. Liu, J. Luo, et al. Recognizing realistic actions from videos in the Wild. CVPR, 2009.Google ScholarCross Ref
B. Morris, et al. A survey of vision-based trajectory learning and analysis for surveillance. TCSVT, 2008. Google ScholarDigital Library
I. Laptev, M. Marsza lek, C. Schmid, et al. Learning realistic human actions from movies. CVPR, 2008.Google ScholarCross Ref
D. Batra, T. Chen and R. Sukthankar. Space-Time shapelets for action recognition. IEEE WMVC, 2008. Google ScholarDigital Library
L. Gorelick, M. Blank, E. Shechtman, et al. Actions as space-time shapes. TPAMI, 2007. Google ScholarDigital Library
R. Tron, et al. A benchmark for the comparison of 3D motion segmentation algorithms. CVPR, 2008.Google Scholar
Y. Cheng, et al. Mean shift, mode seeking, and clustering. TPAMI, 1995. Google ScholarDigital Library
P. Dollar, V. Rabaud, et al. Behavior recognition via sparse spatio-temporal features. In VS-PETS, 2005.Google ScholarCross Ref
X. Wu, Y. Zhang, Y. Wu, J. Guo and J. Li. Invariant visual patterns for video copy detection. ICPR, 2008.Google Scholar
OpenCV: sourceforge.net/projects/opencvlibrary.Google Scholar

Index Terms

Localizing volumetric motion for action recognition in realistic videos
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization

Recommendations

Motion keypoint trajectory and covariance descriptor for human action recognition

Human action recognition from videos is a challenging task in computer vision. In recent years, histogram-based descriptors that are calculated along dense trajectories have shown promising results for human action recognition, but they usually ignore ...
Read More
Hierarchical Filtered Motion for Action Recognition in Crowded Videos

Action recognition with cluttered and moving background is a challenging problem. One main difficulty lies in the fact that the motion field in an action region is contaminated by the background motions. We propose a hierarchical filtered motion (HFM) ...
Read More
Human action segmentation and recognition via motion and shape analysis

In this paper, we present an automated video analysis system which addresses segmentation and detection of human actions in an indoor environment, such as a gym. The system aims at segmenting different movements from the input video and recognizing the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '09: Proceedings of the 17th ACM international conference on Multimedia
October 2009
1202 pages
ISBN:9781605586083
DOI:10.1145/1631272
General Chairs:
Wen Gao
Peking University, China
,
Yong Rui
Microsoft, China
,
Alan Hanjalic
Delft University of Technology, The Netherlands
,
Program Chairs:
Changsheng Xu
Institute of Automation, Chinese Academy of Sciences, China
,
Eckehard Steinbach
Technical University of Munich, Germany
,
Abdulmotaleb El Saddik
University of Ottawa, Canada
,
Michelle Zhou
IBM T. J. Watson Research Center, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
human action recognition
keypoint trajectory
mean-shift clustering
motion subspace learning
realistic videos
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 281
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Localizing volumetric motion for action recognition in realistic videos

MM '09: Proceedings of the 17th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Motion keypoint trajectory and covariance descriptor for human action recognition

Hierarchical Filtered Motion for Action Recognition in Crowded Videos

Human action segmentation and recognition via motion and shape analysis