skip to main content
10.1145/3240508.3240713acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Deep Adaptive Temporal Pooling for Activity Recognition

Authors Info & Claims
Published:15 October 2018Publication History

ABSTRACT

Deep neural networks have recently achieved competitive accuracy for human activity recognition. However, there is room for improvement, especially in modeling of long-term temporal importance and determining the activity relevance of different temporal segments in a video. To address this problem, we propose a learnable and differentiable module: Deep Adaptive Temporal Pooling (DATP). DATP applies a self-attention mechanism to adaptively pool the classification scores of different video segments. Specifically, using frame-level features, DATP regresses importance of different temporal segments, and generates weights for them. Remarkably, DATP is trained using only the video-level label. There is no need of additional supervision except video-level activity class label. We conduct extensive experiments to investigate various input features and different weight models. Experimental results show that DATP can learn to assign large weights to key video segments. More importantly, DATP can improve training of frame-level feature extractor. This is because relevant temporal segments are assigned large weights during back-propagation. Overall, we achieve state-of-the-art performance on UCF101, HMDB51 and Kinetics datasets.

References

  1. Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep?. In Advances in neural information processing systems. 2654--2662. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. 2011. Sequential deep learning for human action recognition. In International Workshop on Human Behavior Understanding. Springer, 29--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 4724--4733.Google ScholarGoogle ScholarCross RefCross Ref
  4. Anoop Cherian, Suvrit Sra, Stephen Gould, and Richard Hartley. 2018. Non-Linear Temporal Subspace Representations for Activity Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2197--2206.Google ScholarGoogle ScholarCross RefCross Ref
  5. Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human detection using oriented histograms of flow and appearance. In European conference on computer vision. Springer, 428--441. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  7. Ali Diba, Vivek Sharma, and Luc Van Gool. 2017. Deep temporal linear encoding networks. In Computer Vision and Pattern Recognition .Google ScholarGoogle Scholar
  8. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition . 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  9. Ionut Cosmin Duta, Bogdan Ionescu, Kiyoharu Aizawa, and Nicu Sebe. 2017. Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition .Google ScholarGoogle ScholarCross RefCross Ref
  10. Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016a. Spatiotemporal Residual Networks for Video Action Recognition. In Advances in Neural Information Processing Systems. 3468--3476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .Google ScholarGoogle ScholarCross RefCross Ref
  12. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016b. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1933--1941.Google ScholarGoogle ScholarCross RefCross Ref
  13. Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. ActionVLAD: Learning spatio-temporal aggregation for action classification. arXiv preprint arXiv:1704.02895 (2017).Google ScholarGoogle Scholar
  14. Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yiluan Guo and Ngai-Man Cheung. 2018. Efficient and Deep Person Re-Identification Using Multi-Level Similarity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2335--2344.Google ScholarGoogle ScholarCross RefCross Ref
  16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  17. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google ScholarGoogle Scholar
  18. Max Jaderberg, Karen Simonyan, Andrew Zisserman, et almbox. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. 2017--2025. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Amlan Kar, Nishant Rai, Karan Sikka, and Gaurav Sharma. 2016. AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos. arXiv preprint arXiv:1611.08240 (2016).Google ScholarGoogle Scholar
  20. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . 1725--1732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et almbox. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).Google ScholarGoogle Scholar
  22. Hildegard Kuehne, Hueihan Jhuang, Est'ibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2556--2563. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  24. Multimedia-Laboratory-CUHK. 2016. TSN Pretrained Models on Kinetics Dataset. http://yjxiong.me/others/kinetics_action/. (2016).Google ScholarGoogle Scholar
  25. Wenjie Pei, Tadas Baltruvs aitis, David MJ Tax, and Louis-Philippe Morency. 2017. Temporal Attention-Gated Model for Robust Sequence Classification. (2017).Google ScholarGoogle Scholar
  26. Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 5534--5542.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image classification with the fisher vector: Theory and practice. International journal of computer vision , Vol. 105, 3 (2013), 222--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM international conference on Multimedia. ACM, 357--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems. 568--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google ScholarGoogle Scholar
  31. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6450--6459.Google ScholarGoogle ScholarCross RefCross Ref
  33. Gul Varol, Ivan Laptev, and Cordelia Schmid. 2017. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).Google ScholarGoogle Scholar
  34. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 6000--6010.Google ScholarGoogle Scholar
  35. Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534--4542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Heng Wang, Alexander Kl"aser, Cordelia Schmid, and Cheng-Lin Liu. 2013. Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision , Vol. 103, 1 (2013), 60--79.Google ScholarGoogle Scholar
  37. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551--3558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Limin Wang, Wei Li, Wen Li, and Luc Van Gool. 2017a. Appearance-and-Relation Networks for Video Classification. arXiv preprint arXiv:1711.09125 (2017).Google ScholarGoogle Scholar
  39. Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 4305--4314.Google ScholarGoogle ScholarCross RefCross Ref
  40. Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017c. Untrimmednets for weakly supervised action recognition and detection. In Proc. CVPR .Google ScholarGoogle ScholarCross RefCross Ref
  41. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016c. Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision. Springer, 20--36.Google ScholarGoogle ScholarCross RefCross Ref
  42. Xiaolong Wang, Ali Farhadi, and Abhinav Gupta. 2016a. Actions transformations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2658--2667.Google ScholarGoogle ScholarCross RefCross Ref
  43. Yunbo Wang, Mingsheng Long, Jianmin Wang, and Philip S Yu. 2017b. Spatiotemporal Pyramid Network for Video Action Recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition .Google ScholarGoogle ScholarCross RefCross Ref
  44. Yilin Wang, Suhang Wang, Jiliang Tang, Neil O'Hare, Yi Chang, and Baoxin Li. 2016b. Hierarchical attention network for action recognition in videos. arXiv preprint arXiv:1607.06416 (2016).Google ScholarGoogle Scholar
  45. Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision. 4507--4515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime TV-L 1 optical flow. In Joint Pattern Recognition Symposium. Springer, 214--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Yizhou Zhou, Xiaoyan Sun, Dong Liu, Zhengjun Zha, and Wenjun Zeng. 2017. Adaptive Pooling in Multi-Instance Learning for Web Video Annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 318--327.Google ScholarGoogle Scholar
  48. Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and Wenjun Zeng. 2018. MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 449--458.Google ScholarGoogle ScholarCross RefCross Ref
  49. Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. 2016. A key volume mining deep framework for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1991--1999.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Deep Adaptive Temporal Pooling for Activity Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '18: Proceedings of the 26th ACM international conference on Multimedia
      October 2018
      2167 pages
      ISBN:9781450356657
      DOI:10.1145/3240508

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 October 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader