research-article

Single Shot Temporal Action Detection

Authors:
Tianwei Lin

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Xu Zhao

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Zheng Shou

Columbia University, New York, NY, USA

Columbia University, New York, NY, USA
View Profile

MM '17: Proceedings of the 25th ACM international conference on MultimediaOctober 2017Pages 988–996https://doi.org/10.1145/3123266.3123343

Published:19 October 2017Publication History

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 988–996

ABSTRACT

Temporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting start time and end time of each action instance. Many state-of-the-art methods adopt the "detection by classification" framework: first do proposal, and then classify proposals. The main drawback of this framework is that the boundaries of action instance proposals have been fixed during the classification step. To address this issue, we propose a novel Single Shot Action Detector (SSAD) network based on 1D temporal convolutional layers to skip the proposal generation step via directly detecting action instances in untrimmed video. On pursuit of designing a particular SSAD network that can work effectively for temporal action detection, we empirically search for the best network architecture of SSAD due to lacking existing models that can be directly adopted. Moreover, we investigate into input feature types and fusion strategies to further improve detection accuracy. We conduct extensive experiments on two challenging datasets: THUMOS 2014 and MEXaction2. When setting Intersection-over-Union threshold to 0.5 during evaluation, SSAD significantly outperforms other state-of-the-art systems by increasing mAP from $19.0%$ to $24.6%$ on THUMOS 2014 and from 7.4% to $11.0%$ on MEXaction2.

References

M. Abadi, A. Agarwal, P. Barham, and others. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).Google Scholar
F. Caba Heilbron, J. Carlos Niebles, and B. Ghanem. 2016. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1914--1923.Google Scholar
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Feifei. 2009. ImageNet: A large-scale hierarchical image database. (2009), 248--255.Google Scholar
V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem. 2016. Daps: Deep action proposals for action understanding European Conference on Computer Vision. Springer, 768--784.Google Scholar
C. Feichtenhofer, A. Pinz, and A. Zisserman. 2016. Convolutional two-stream network fusion for video action recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933--1941.Google Scholar
J. Gemert, M. Jain, E. Gati, C. G. Snoek, and others. 2015. Apt: Action localization proposals from dense trajectories. BMVA Press.Google Scholar
R. Girshick. 2015. Fast r-cnn Proceedings of the IEEE International Conference on Computer Vision. 1440--1448. Google ScholarDigital Library
R. Girshick, J. Donahue, T. Darrell, and J. Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation Proceedings of the IEEE conference on computer vision and pattern recognition. 580--587. Google ScholarDigital Library
X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. Aistats, Vol. Vol. 9. 249--256.Google Scholar
X. Glorot, A. Bordes, and Y. Bengio. 2011. Deep Sparse Rectifier Neural Networks.. In Aistats, Vol. Vol. 15. 275.Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675--678. Google ScholarDigital Library
Y. G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. 2014. THUMOS challenge: Action recognition with a large number of classes ECCV Workshop.Google Scholar
S. Karaman, L. Seidenari, and A. Del Bimbo. 2014. Fast saliency based pooling of fisher encoded dense trajectories ECCV THUMOS Workshop, Vol. Vol. 1.Google Scholar
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. 2014. Large-scale video classification with convolutional neural networks Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732. Google ScholarDigital Library
D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre. 2013. HMDB51: A large video database for human motion recognition. High Performance Computing in Science and Engineering '12. Springer, 571--582.Google Scholar
C. Lea, R. Vidal, A. Reiter, and G. D. Hager. 2016. Temporal Convolutional Networks: A Unified Approach to Action Segmentation Computer Vision--ECCV 2016 Workshops. Springer, 47--54.Google Scholar
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg. 2016. SSD: Single shot multibox detector. In European Conference on Computer Vision. Springer, 21--37.Google Scholar
S. Ma, L. Sigal, and S. Sclaroff. 2016. Learning activity progression in LSTMs for activity detection and early detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1942--1950.Google Scholar
P. Mettes, J. C. van Gemert, and C. G. Snoek. 2016. Spot on: Action localization from pointly-supervised proposals European Conference on Computer Vision. Springer, 437--453.Google Scholar
D. Oneata, J. Verbeek, and C. Schmid. 2014. The LEAR submission at Thumos 2014. ECCV THUMOS Workshop (2014).Google Scholar
Z. Qiu, T. Yao, and T. Mei. 2016. Deep Quantization: Encoding Convolutional Activations with Deep Generative Model. arXiv preprint arXiv:1611.09502 (2016).Google Scholar
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. 2016. You only look once: Unified, real-time object detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779--788.Google Scholar
J. Redmon and A. Farhadi. 2016. YOLO9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242 (2016).Google Scholar
S. Ren, K. He, R. Girshick, and J. Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks Advances in neural information processing systems. 91--99. Google ScholarDigital Library
A. Richard and J. Gall. 2016. Temporal action detection using a statistical language model Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3131--3140.Google Scholar
Z. Shou, D. Wang, and S.-F. Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049--1058.Google Scholar
K. Simonyan and A. Zisserman. 2014. Two-stream convolutional networks for action recognition in videos Advances in Neural Information Processing Systems. 568--576. Google ScholarDigital Library
K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition International Conference on Learning Representations.Google Scholar
B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao. 2016. A multi-stream bi-directional recurrent neural network for fine-grained action detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1961--1970.Google Scholar
G. Singh and F. Cuzzolin. 2016. Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge. arXiv preprint arXiv:1607.01979 (2016).Google Scholar
K. Soomro, A. R. Zamir, and M. Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google Scholar
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarDigital Library
H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. 2011. Action recognition by dense trajectories. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 3169--3176. Google ScholarDigital Library
H. Wang and C. Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551--3558. Google ScholarDigital Library
L. Wang, Y. Qiao, and X. Tang. 2014. Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge Vol. 1 (2014), 2.Google Scholar
L. Wang, Y. Xiong, Z. Wang, and Y. Qiao. 2015. Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015).Google Scholar
R. Wang and D. Tao. 2016. UTS at activitynet 2016. AcitivityNet Large Scale Activity Recognition Challenge Vol. 2016 (2016), 8.Google Scholar
S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2678--2687.Google Scholar
G. Yu and J. Yuan. 2015. Fast action proposals for human action detection and search Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1302--1311.Google Scholar
J. Yuan, B. Ni, X. Yang, and A. A. Kassim. 2016. Temporal Action Localization with Pyramid of Score Distribution Features IEEE Conference on Computer Vision and Pattern Recognition. 3093--3102.Google Scholar
Y. Zhu and S. Newsam. 2016. Efficient Action Detection in Untrimmed Videos via Multi-Task Learning. arXiv preprint arXiv:1612.07403 (2016).Google Scholar

Index Terms

Single Shot Temporal Action Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation
Computer Vision – ECCV 2018
Abstract
Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content. This ...
Read More
Temporal Action Detection with Structured Segment Networks
Abstract
This paper addresses an important and challenging task, namely detecting the temporal intervals of actions in untrimmed videos. Specifically, we present a framework called structured segment network (SSN). It is built on temporal proposals of ...
Read More
A coarse-to-fine temporal action detection method combining light and heavy networks
Abstract
Temporal action detection aims to judge whether there existing a certain number of action instances in a long untrimmed videos and to locate the start and end time of each action. Even though the existing action detection methods have shown ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ssad network
temporal action detection
untrimmed video
Qualifiers
- research-article
Conference

Acceptance Rates
MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 310
  Total Citations
  View Citations
- 902
  Total Downloads
- Downloads (Last 12 months)64
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Single Shot Temporal Action Detection

MM '17: Proceedings of the 25th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

Temporal Action Detection with Structured Segment Networks

A coarse-to-fine temporal action detection method combining light and heavy networks