skip to main content
10.1145/3123266.3123343acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Single Shot Temporal Action Detection

Published:19 October 2017Publication History

ABSTRACT

Temporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting start time and end time of each action instance. Many state-of-the-art methods adopt the "detection by classification" framework: first do proposal, and then classify proposals. The main drawback of this framework is that the boundaries of action instance proposals have been fixed during the classification step. To address this issue, we propose a novel Single Shot Action Detector (SSAD) network based on 1D temporal convolutional layers to skip the proposal generation step via directly detecting action instances in untrimmed video. On pursuit of designing a particular SSAD network that can work effectively for temporal action detection, we empirically search for the best network architecture of SSAD due to lacking existing models that can be directly adopted. Moreover, we investigate into input feature types and fusion strategies to further improve detection accuracy. We conduct extensive experiments on two challenging datasets: THUMOS 2014 and MEXaction2. When setting Intersection-over-Union threshold to 0.5 during evaluation, SSAD significantly outperforms other state-of-the-art systems by increasing mAP from $19.0%$ to $24.6%$ on THUMOS 2014 and from 7.4% to $11.0%$ on MEXaction2.

References

  1. M. Abadi, A. Agarwal, P. Barham, and others. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).Google ScholarGoogle Scholar
  2. F. Caba Heilbron, J. Carlos Niebles, and B. Ghanem. 2016. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1914--1923.Google ScholarGoogle Scholar
  3. J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Feifei. 2009. ImageNet: A large-scale hierarchical image database. (2009), 248--255.Google ScholarGoogle Scholar
  4. V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem. 2016. Daps: Deep action proposals for action understanding European Conference on Computer Vision. Springer, 768--784.Google ScholarGoogle Scholar
  5. C. Feichtenhofer, A. Pinz, and A. Zisserman. 2016. Convolutional two-stream network fusion for video action recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933--1941.Google ScholarGoogle Scholar
  6. J. Gemert, M. Jain, E. Gati, C. G. Snoek, and others. 2015. Apt: Action localization proposals from dense trajectories. BMVA Press.Google ScholarGoogle Scholar
  7. R. Girshick. 2015. Fast r-cnn Proceedings of the IEEE International Conference on Computer Vision. 1440--1448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Girshick, J. Donahue, T. Darrell, and J. Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation Proceedings of the IEEE conference on computer vision and pattern recognition. 580--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. Aistats, Vol. Vol. 9. 249--256.Google ScholarGoogle Scholar
  10. X. Glorot, A. Bordes, and Y. Bengio. 2011. Deep Sparse Rectifier Neural Networks.. In Aistats, Vol. Vol. 15. 275.Google ScholarGoogle Scholar
  11. K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle Scholar
  12. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. 2014. THUMOS challenge: Action recognition with a large number of classes ECCV Workshop.Google ScholarGoogle Scholar
  14. S. Karaman, L. Seidenari, and A. Del Bimbo. 2014. Fast saliency based pooling of fisher encoded dense trajectories ECCV THUMOS Workshop, Vol. Vol. 1.Google ScholarGoogle Scholar
  15. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. 2014. Large-scale video classification with convolutional neural networks Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  17. H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre. 2013. HMDB51: A large video database for human motion recognition. High Performance Computing in Science and Engineering '12. Springer, 571--582.Google ScholarGoogle Scholar
  18. C. Lea, R. Vidal, A. Reiter, and G. D. Hager. 2016. Temporal Convolutional Networks: A Unified Approach to Action Segmentation Computer Vision--ECCV 2016 Workshops. Springer, 47--54.Google ScholarGoogle Scholar
  19. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg. 2016. SSD: Single shot multibox detector. In European Conference on Computer Vision. Springer, 21--37.Google ScholarGoogle Scholar
  20. S. Ma, L. Sigal, and S. Sclaroff. 2016. Learning activity progression in LSTMs for activity detection and early detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1942--1950.Google ScholarGoogle Scholar
  21. P. Mettes, J. C. van Gemert, and C. G. Snoek. 2016. Spot on: Action localization from pointly-supervised proposals European Conference on Computer Vision. Springer, 437--453.Google ScholarGoogle Scholar
  22. D. Oneata, J. Verbeek, and C. Schmid. 2014. The LEAR submission at Thumos 2014. ECCV THUMOS Workshop (2014).Google ScholarGoogle Scholar
  23. Z. Qiu, T. Yao, and T. Mei. 2016. Deep Quantization: Encoding Convolutional Activations with Deep Generative Model. arXiv preprint arXiv:1611.09502 (2016).Google ScholarGoogle Scholar
  24. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. 2016. You only look once: Unified, real-time object detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779--788.Google ScholarGoogle Scholar
  25. J. Redmon and A. Farhadi. 2016. YOLO9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242 (2016).Google ScholarGoogle Scholar
  26. S. Ren, K. He, R. Girshick, and J. Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks Advances in neural information processing systems. 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Richard and J. Gall. 2016. Temporal action detection using a statistical language model Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3131--3140.Google ScholarGoogle Scholar
  28. Z. Shou, D. Wang, and S.-F. Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049--1058.Google ScholarGoogle Scholar
  29. K. Simonyan and A. Zisserman. 2014. Two-stream convolutional networks for action recognition in videos Advances in Neural Information Processing Systems. 568--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition International Conference on Learning Representations.Google ScholarGoogle Scholar
  31. B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao. 2016. A multi-stream bi-directional recurrent neural network for fine-grained action detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1961--1970.Google ScholarGoogle Scholar
  32. G. Singh and F. Cuzzolin. 2016. Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge. arXiv preprint arXiv:1607.01979 (2016).Google ScholarGoogle Scholar
  33. K. Soomro, A. R. Zamir, and M. Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google ScholarGoogle Scholar
  34. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarGoogle Scholar
  35. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. 2011. Action recognition by dense trajectories. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 3169--3176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. Wang and C. Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551--3558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. L. Wang, Y. Qiao, and X. Tang. 2014. Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge Vol. 1 (2014), 2.Google ScholarGoogle Scholar
  39. L. Wang, Y. Xiong, Z. Wang, and Y. Qiao. 2015. Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015).Google ScholarGoogle Scholar
  40. R. Wang and D. Tao. 2016. UTS at activitynet 2016. AcitivityNet Large Scale Activity Recognition Challenge Vol. 2016 (2016), 8.Google ScholarGoogle Scholar
  41. S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2678--2687.Google ScholarGoogle Scholar
  42. G. Yu and J. Yuan. 2015. Fast action proposals for human action detection and search Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1302--1311.Google ScholarGoogle Scholar
  43. J. Yuan, B. Ni, X. Yang, and A. A. Kassim. 2016. Temporal Action Localization with Pyramid of Score Distribution Features IEEE Conference on Computer Vision and Pattern Recognition. 3093--3102.Google ScholarGoogle Scholar
  44. Y. Zhu and S. Newsam. 2016. Efficient Action Detection in Untrimmed Videos via Multi-Task Learning. arXiv preprint arXiv:1612.07403 (2016).Google ScholarGoogle Scholar

Index Terms

  1. Single Shot Temporal Action Detection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '17: Proceedings of the 25th ACM international conference on Multimedia
      October 2017
      2028 pages
      ISBN:9781450349062
      DOI:10.1145/3123266

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 October 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader