research-article

Deep Adaptive Temporal Pooling for Activity Recognition

Authors:
Sibo Song

Singapore University of Technology and Design, Singapore, Singapore

Singapore University of Technology and Design, Singapore, Singapore
View Profile

,
Ngai-Man Cheung

Singapore University of Technology and Design, Singapore, Singapore

Singapore University of Technology and Design, Singapore, Singapore
View Profile

,
Vijay Chandrasekhar

Institute for Infocomm Research, Singapore, Singapore

Institute for Infocomm Research, Singapore, Singapore
View Profile

,
Bappaditya Mandal

Keele University, Keele, Staffordshire, United Kingdom

Keele University, Keele, Staffordshire, United Kingdom
View Profile

MM '18: Proceedings of the 26th ACM international conference on MultimediaOctober 2018Pages 1829–1837https://doi.org/10.1145/3240508.3240713

Published:15 October 2018Publication History

MM '18: Proceedings of the 26th ACM international conference on Multimedia

Pages 1829–1837

ABSTRACT

Deep neural networks have recently achieved competitive accuracy for human activity recognition. However, there is room for improvement, especially in modeling of long-term temporal importance and determining the activity relevance of different temporal segments in a video. To address this problem, we propose a learnable and differentiable module: Deep Adaptive Temporal Pooling (DATP). DATP applies a self-attention mechanism to adaptively pool the classification scores of different video segments. Specifically, using frame-level features, DATP regresses importance of different temporal segments, and generates weights for them. Remarkably, DATP is trained using only the video-level label. There is no need of additional supervision except video-level activity class label. We conduct extensive experiments to investigate various input features and different weight models. Experimental results show that DATP can learn to assign large weights to key video segments. More importantly, DATP can improve training of frame-level feature extractor. This is because relevant temporal segments are assigned large weights during back-propagation. Overall, we achieve state-of-the-art performance on UCF101, HMDB51 and Kinetics datasets.

References

Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep?. In Advances in neural information processing systems. 2654--2662. Google ScholarDigital Library
Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. 2011. Sequential deep learning for human action recognition. In International Workshop on Human Behavior Understanding. Springer, 29--39. Google ScholarDigital Library
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 4724--4733.Google ScholarCross Ref
Anoop Cherian, Suvrit Sra, Stephen Gould, and Richard Hartley. 2018. Non-Linear Temporal Subspace Representations for Activity Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2197--2206.Google ScholarCross Ref
Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human detection using oriented histograms of flow and appearance. In European conference on computer vision. Springer, 428--441. Google ScholarDigital Library
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248--255.Google ScholarCross Ref
Ali Diba, Vivek Sharma, and Luc Van Gool. 2017. Deep temporal linear encoding networks. In Computer Vision and Pattern Recognition .Google Scholar
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition . 2625--2634.Google ScholarCross Ref
Ionut Cosmin Duta, Bogdan Ionescu, Kiyoharu Aizawa, and Nicu Sebe. 2017. Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition .Google ScholarCross Ref
Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016a. Spatiotemporal Residual Networks for Video Action Recognition. In Advances in Neural Information Processing Systems. 3468--3476. Google ScholarDigital Library
Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .Google ScholarCross Ref
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016b. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1933--1941.Google ScholarCross Ref
Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. ActionVLAD: Learning spatio-temporal aggregation for action classification. arXiv preprint arXiv:1704.02895 (2017).Google Scholar
Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448. Google ScholarDigital Library
Yiluan Guo and Ngai-Man Cheung. 2018. Efficient and Deep Person Re-Identification Using Multi-Level Similarity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2335--2344.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google Scholar
Max Jaderberg, Karen Simonyan, Andrew Zisserman, et almbox. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. 2017--2025. Google ScholarDigital Library
Amlan Kar, Nishant Rai, Karan Sikka, and Gaurav Sharma. 2016. AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos. arXiv preprint arXiv:1611.08240 (2016).Google Scholar
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . 1725--1732. Google ScholarDigital Library
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et almbox. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).Google Scholar
Hildegard Kuehne, Hueihan Jhuang, Est'ibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2556--2563. Google ScholarDigital Library
Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 1--8.Google ScholarCross Ref
Multimedia-Laboratory-CUHK. 2016. TSN Pretrained Models on Kinetics Dataset. http://yjxiong.me/others/kinetics_action/. (2016).Google Scholar
Wenjie Pei, Tadas Baltruvs aitis, David MJ Tax, and Louis-Philippe Morency. 2017. Temporal Attention-Gated Model for Robust Sequence Classification. (2017).Google Scholar
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 5534--5542.Google ScholarCross Ref
Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image classification with the fisher vector: Theory and practice. International journal of computer vision , Vol. 105, 3 (2013), 222--245. Google ScholarDigital Library
Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM international conference on Multimedia. ACM, 357--360. Google ScholarDigital Library
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems. 568--576. Google ScholarDigital Library
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarDigital Library
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6450--6459.Google ScholarCross Ref
Gul Varol, Ivan Laptev, and Cordelia Schmid. 2017. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 6000--6010.Google Scholar
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534--4542. Google ScholarDigital Library
Heng Wang, Alexander Kl"aser, Cordelia Schmid, and Cheng-Lin Liu. 2013. Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision , Vol. 103, 1 (2013), 60--79.Google Scholar
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551--3558. Google ScholarDigital Library
Limin Wang, Wei Li, Wen Li, and Luc Van Gool. 2017a. Appearance-and-Relation Networks for Video Classification. arXiv preprint arXiv:1711.09125 (2017).Google Scholar
Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 4305--4314.Google ScholarCross Ref
Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017c. Untrimmednets for weakly supervised action recognition and detection. In Proc. CVPR .Google ScholarCross Ref
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016c. Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision. Springer, 20--36.Google ScholarCross Ref
Xiaolong Wang, Ali Farhadi, and Abhinav Gupta. 2016a. Actions transformations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2658--2667.Google ScholarCross Ref
Yunbo Wang, Mingsheng Long, Jianmin Wang, and Philip S Yu. 2017b. Spatiotemporal Pyramid Network for Video Action Recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition .Google ScholarCross Ref
Yilin Wang, Suhang Wang, Jiliang Tang, Neil O'Hare, Yi Chang, and Baoxin Li. 2016b. Hierarchical attention network for action recognition in videos. arXiv preprint arXiv:1607.06416 (2016).Google Scholar
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision. 4507--4515. Google ScholarDigital Library
Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime TV-L 1 optical flow. In Joint Pattern Recognition Symposium. Springer, 214--223. Google ScholarDigital Library
Yizhou Zhou, Xiaoyan Sun, Dong Liu, Zhengjun Zha, and Wenjun Zeng. 2017. Adaptive Pooling in Multi-Instance Learning for Web Video Annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 318--327.Google Scholar
Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and Wenjun Zeng. 2018. MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 449--458.Google ScholarCross Ref
Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. 2016. A key volume mining deep framework for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1991--1999.Google ScholarCross Ref

Index Terms

Deep Adaptive Temporal Pooling for Activity Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Few-Shot Human Activity Recognition on Noisy Wearable Sensor Data
Database Systems for Advanced Applications
Abstract
Most existing wearable sensor-based human activity recognition (HAR) models are trained on substantial labeled data. It is difficult for HAR to learn new-class activities unseen during training from a few samples. Very few researches of few-shot ...
Read More
Self-supervised learning with randomized cross-sensor masked reconstruction for human activity recognition
Abstract
Self-supervised learning (SSL) has gained prominence in the field of accelerometer-based human activity recognition (HAR) due to its ability to learn from both labeled and unlabeled data. While labeled data acquisition is costly, it is relatively ...
Graphical abstract

Display Omitted
Highlights
- New self-supervised cross-sensor learning auxiliary task (RCSMR) for HAR.
- Large-scale dual-sensor pre-training, single-sensor downstream training.
- RCSMR pre-trained Transformer outperforms supervised & SSL methods on 7 HAR ...
Read More
A survey on unsupervised learning for wearable sensor-based activity recognition
Abstract
Human Activity Recognition (HAR) is an essential task in various applications such as pervasive healthcare, smart environment, and security and surveillance. The need to develop accurate HAR systems has motivated researchers to propose various ...
Highlights
- Adoption of unsupervised learning in wearable sensor-datasets.
- Survey to investigate the evolution of human activity recognition.
- Existing wearable sensor-based datasets and state-of-the-art clustering approach.
- Existing data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '18: Proceedings of the 26th ACM international conference on Multimedia
October 2018
2167 pages
ISBN:9781450356657
DOI:10.1145/3240508
General Chairs:
Susanne Boll
University of Oldenburg, Germany
,
Kyoung Mu Lee
Seoul National University, Korea
,
Jiebo Luo
University of Rochester, USA
,
Wenwu Zhu
Tsinghua University, China
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Chang Wen Chen
State Univ. Of New York at Buffalo, USA
,
Rainer Lienhart
University of Augsburg, Germany
,
Tao Mei
JD AI, China
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adaptive temporal pooling
human activity recognition
Qualifiers
- research-article
Conference

Acceptance Rates
MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 212
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Deep Adaptive Temporal Pooling for Activity Recognition

MM '18: Proceedings of the 26th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Few-Shot Human Activity Recognition on Noisy Wearable Sensor Data

Self-supervised learning with randomized cross-sensor masked reconstruction for human activity recognition

A survey on unsupervised learning for wearable sensor-based activity recognition