ABSTRACT
Temporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting start time and end time of each action instance. Many state-of-the-art methods adopt the "detection by classification" framework: first do proposal, and then classify proposals. The main drawback of this framework is that the boundaries of action instance proposals have been fixed during the classification step. To address this issue, we propose a novel Single Shot Action Detector (SSAD) network based on 1D temporal convolutional layers to skip the proposal generation step via directly detecting action instances in untrimmed video. On pursuit of designing a particular SSAD network that can work effectively for temporal action detection, we empirically search for the best network architecture of SSAD due to lacking existing models that can be directly adopted. Moreover, we investigate into input feature types and fusion strategies to further improve detection accuracy. We conduct extensive experiments on two challenging datasets: THUMOS 2014 and MEXaction2. When setting Intersection-over-Union threshold to 0.5 during evaluation, SSAD significantly outperforms other state-of-the-art systems by increasing mAP from $19.0%$ to $24.6%$ on THUMOS 2014 and from 7.4% to $11.0%$ on MEXaction2.
- M. Abadi, A. Agarwal, P. Barham, and others. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).Google Scholar
- F. Caba Heilbron, J. Carlos Niebles, and B. Ghanem. 2016. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1914--1923.Google Scholar
- J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Feifei. 2009. ImageNet: A large-scale hierarchical image database. (2009), 248--255.Google Scholar
- V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem. 2016. Daps: Deep action proposals for action understanding European Conference on Computer Vision. Springer, 768--784.Google Scholar
- C. Feichtenhofer, A. Pinz, and A. Zisserman. 2016. Convolutional two-stream network fusion for video action recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933--1941.Google Scholar
- J. Gemert, M. Jain, E. Gati, C. G. Snoek, and others. 2015. Apt: Action localization proposals from dense trajectories. BMVA Press.Google Scholar
- R. Girshick. 2015. Fast r-cnn Proceedings of the IEEE International Conference on Computer Vision. 1440--1448. Google ScholarDigital Library
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation Proceedings of the IEEE conference on computer vision and pattern recognition. 580--587. Google ScholarDigital Library
- X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. Aistats, Vol. Vol. 9. 249--256.Google Scholar
- X. Glorot, A. Bordes, and Y. Bengio. 2011. Deep Sparse Rectifier Neural Networks.. In Aistats, Vol. Vol. 15. 275.Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675--678. Google ScholarDigital Library
- Y. G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. 2014. THUMOS challenge: Action recognition with a large number of classes ECCV Workshop.Google Scholar
- S. Karaman, L. Seidenari, and A. Del Bimbo. 2014. Fast saliency based pooling of fisher encoded dense trajectories ECCV THUMOS Workshop, Vol. Vol. 1.Google Scholar
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. 2014. Large-scale video classification with convolutional neural networks Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732. Google ScholarDigital Library
- D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre. 2013. HMDB51: A large video database for human motion recognition. High Performance Computing in Science and Engineering '12. Springer, 571--582.Google Scholar
- C. Lea, R. Vidal, A. Reiter, and G. D. Hager. 2016. Temporal Convolutional Networks: A Unified Approach to Action Segmentation Computer Vision--ECCV 2016 Workshops. Springer, 47--54.Google Scholar
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg. 2016. SSD: Single shot multibox detector. In European Conference on Computer Vision. Springer, 21--37.Google Scholar
- S. Ma, L. Sigal, and S. Sclaroff. 2016. Learning activity progression in LSTMs for activity detection and early detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1942--1950.Google Scholar
- P. Mettes, J. C. van Gemert, and C. G. Snoek. 2016. Spot on: Action localization from pointly-supervised proposals European Conference on Computer Vision. Springer, 437--453.Google Scholar
- D. Oneata, J. Verbeek, and C. Schmid. 2014. The LEAR submission at Thumos 2014. ECCV THUMOS Workshop (2014).Google Scholar
- Z. Qiu, T. Yao, and T. Mei. 2016. Deep Quantization: Encoding Convolutional Activations with Deep Generative Model. arXiv preprint arXiv:1611.09502 (2016).Google Scholar
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. 2016. You only look once: Unified, real-time object detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779--788.Google Scholar
- J. Redmon and A. Farhadi. 2016. YOLO9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242 (2016).Google Scholar
- S. Ren, K. He, R. Girshick, and J. Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks Advances in neural information processing systems. 91--99. Google ScholarDigital Library
- A. Richard and J. Gall. 2016. Temporal action detection using a statistical language model Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3131--3140.Google Scholar
- Z. Shou, D. Wang, and S.-F. Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049--1058.Google Scholar
- K. Simonyan and A. Zisserman. 2014. Two-stream convolutional networks for action recognition in videos Advances in Neural Information Processing Systems. 568--576. Google ScholarDigital Library
- K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition International Conference on Learning Representations.Google Scholar
- B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao. 2016. A multi-stream bi-directional recurrent neural network for fine-grained action detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1961--1970.Google Scholar
- G. Singh and F. Cuzzolin. 2016. Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge. arXiv preprint arXiv:1607.01979 (2016).Google Scholar
- K. Soomro, A. R. Zamir, and M. Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google Scholar
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarDigital Library
- H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. 2011. Action recognition by dense trajectories. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 3169--3176. Google ScholarDigital Library
- H. Wang and C. Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551--3558. Google ScholarDigital Library
- L. Wang, Y. Qiao, and X. Tang. 2014. Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge Vol. 1 (2014), 2.Google Scholar
- L. Wang, Y. Xiong, Z. Wang, and Y. Qiao. 2015. Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015).Google Scholar
- R. Wang and D. Tao. 2016. UTS at activitynet 2016. AcitivityNet Large Scale Activity Recognition Challenge Vol. 2016 (2016), 8.Google Scholar
- S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2678--2687.Google Scholar
- G. Yu and J. Yuan. 2015. Fast action proposals for human action detection and search Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1302--1311.Google Scholar
- J. Yuan, B. Ni, X. Yang, and A. A. Kassim. 2016. Temporal Action Localization with Pyramid of Score Distribution Features IEEE Conference on Computer Vision and Pattern Recognition. 3093--3102.Google Scholar
- Y. Zhu and S. Newsam. 2016. Efficient Action Detection in Untrimmed Videos via Multi-Task Learning. arXiv preprint arXiv:1612.07403 (2016).Google Scholar
Index Terms
- Single Shot Temporal Action Detection
Recommendations
BSN: Boundary Sensitive Network for Temporal Action Proposal Generation
Computer Vision – ECCV 2018AbstractTemporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content. This ...
Temporal Action Detection with Structured Segment Networks
AbstractThis paper addresses an important and challenging task, namely detecting the temporal intervals of actions in untrimmed videos. Specifically, we present a framework called structured segment network (SSN). It is built on temporal proposals of ...
A coarse-to-fine temporal action detection method combining light and heavy networks
AbstractTemporal action detection aims to judge whether there existing a certain number of action instances in a long untrimmed videos and to locate the start and end time of each action. Even though the existing action detection methods have shown ...
Comments