ABSTRACT
Deep neural networks have recently achieved competitive accuracy for human activity recognition. However, there is room for improvement, especially in modeling of long-term temporal importance and determining the activity relevance of different temporal segments in a video. To address this problem, we propose a learnable and differentiable module: Deep Adaptive Temporal Pooling (DATP). DATP applies a self-attention mechanism to adaptively pool the classification scores of different video segments. Specifically, using frame-level features, DATP regresses importance of different temporal segments, and generates weights for them. Remarkably, DATP is trained using only the video-level label. There is no need of additional supervision except video-level activity class label. We conduct extensive experiments to investigate various input features and different weight models. Experimental results show that DATP can learn to assign large weights to key video segments. More importantly, DATP can improve training of frame-level feature extractor. This is because relevant temporal segments are assigned large weights during back-propagation. Overall, we achieve state-of-the-art performance on UCF101, HMDB51 and Kinetics datasets.
- Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep?. In Advances in neural information processing systems. 2654--2662. Google ScholarDigital Library
- Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. 2011. Sequential deep learning for human action recognition. In International Workshop on Human Behavior Understanding. Springer, 29--39. Google ScholarDigital Library
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 4724--4733.Google ScholarCross Ref
- Anoop Cherian, Suvrit Sra, Stephen Gould, and Richard Hartley. 2018. Non-Linear Temporal Subspace Representations for Activity Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2197--2206.Google ScholarCross Ref
- Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human detection using oriented histograms of flow and appearance. In European conference on computer vision. Springer, 428--441. Google ScholarDigital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248--255.Google ScholarCross Ref
- Ali Diba, Vivek Sharma, and Luc Van Gool. 2017. Deep temporal linear encoding networks. In Computer Vision and Pattern Recognition .Google Scholar
- Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition . 2625--2634.Google ScholarCross Ref
- Ionut Cosmin Duta, Bogdan Ionescu, Kiyoharu Aizawa, and Nicu Sebe. 2017. Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition .Google ScholarCross Ref
- Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016a. Spatiotemporal Residual Networks for Video Action Recognition. In Advances in Neural Information Processing Systems. 3468--3476. Google ScholarDigital Library
- Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .Google ScholarCross Ref
- Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016b. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1933--1941.Google ScholarCross Ref
- Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. ActionVLAD: Learning spatio-temporal aggregation for action classification. arXiv preprint arXiv:1704.02895 (2017).Google Scholar
- Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448. Google ScholarDigital Library
- Yiluan Guo and Ngai-Man Cheung. 2018. Efficient and Deep Person Re-Identification Using Multi-Level Similarity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2335--2344.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google Scholar
- Max Jaderberg, Karen Simonyan, Andrew Zisserman, et almbox. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. 2017--2025. Google ScholarDigital Library
- Amlan Kar, Nishant Rai, Karan Sikka, and Gaurav Sharma. 2016. AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos. arXiv preprint arXiv:1611.08240 (2016).Google Scholar
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . 1725--1732. Google ScholarDigital Library
- Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et almbox. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).Google Scholar
- Hildegard Kuehne, Hueihan Jhuang, Est'ibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2556--2563. Google ScholarDigital Library
- Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 1--8.Google ScholarCross Ref
- Multimedia-Laboratory-CUHK. 2016. TSN Pretrained Models on Kinetics Dataset. http://yjxiong.me/others/kinetics_action/. (2016).Google Scholar
- Wenjie Pei, Tadas Baltruvs aitis, David MJ Tax, and Louis-Philippe Morency. 2017. Temporal Attention-Gated Model for Robust Sequence Classification. (2017).Google Scholar
- Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 5534--5542.Google ScholarCross Ref
- Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image classification with the fisher vector: Theory and practice. International journal of computer vision , Vol. 105, 3 (2013), 222--245. Google ScholarDigital Library
- Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM international conference on Multimedia. ACM, 357--360. Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems. 568--576. Google ScholarDigital Library
- Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarDigital Library
- Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6450--6459.Google ScholarCross Ref
- Gul Varol, Ivan Laptev, and Cordelia Schmid. 2017. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 6000--6010.Google Scholar
- Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534--4542. Google ScholarDigital Library
- Heng Wang, Alexander Kl"aser, Cordelia Schmid, and Cheng-Lin Liu. 2013. Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision , Vol. 103, 1 (2013), 60--79.Google Scholar
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551--3558. Google ScholarDigital Library
- Limin Wang, Wei Li, Wen Li, and Luc Van Gool. 2017a. Appearance-and-Relation Networks for Video Classification. arXiv preprint arXiv:1711.09125 (2017).Google Scholar
- Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 4305--4314.Google ScholarCross Ref
- Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017c. Untrimmednets for weakly supervised action recognition and detection. In Proc. CVPR .Google ScholarCross Ref
- Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016c. Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision. Springer, 20--36.Google ScholarCross Ref
- Xiaolong Wang, Ali Farhadi, and Abhinav Gupta. 2016a. Actions transformations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2658--2667.Google ScholarCross Ref
- Yunbo Wang, Mingsheng Long, Jianmin Wang, and Philip S Yu. 2017b. Spatiotemporal Pyramid Network for Video Action Recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition .Google ScholarCross Ref
- Yilin Wang, Suhang Wang, Jiliang Tang, Neil O'Hare, Yi Chang, and Baoxin Li. 2016b. Hierarchical attention network for action recognition in videos. arXiv preprint arXiv:1607.06416 (2016).Google Scholar
- Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision. 4507--4515. Google ScholarDigital Library
- Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime TV-L 1 optical flow. In Joint Pattern Recognition Symposium. Springer, 214--223. Google ScholarDigital Library
- Yizhou Zhou, Xiaoyan Sun, Dong Liu, Zhengjun Zha, and Wenjun Zeng. 2017. Adaptive Pooling in Multi-Instance Learning for Web Video Annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 318--327.Google Scholar
- Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and Wenjun Zeng. 2018. MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 449--458.Google ScholarCross Ref
- Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. 2016. A key volume mining deep framework for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1991--1999.Google ScholarCross Ref
Index Terms
- Deep Adaptive Temporal Pooling for Activity Recognition
Recommendations
Few-Shot Human Activity Recognition on Noisy Wearable Sensor Data
Database Systems for Advanced ApplicationsAbstractMost existing wearable sensor-based human activity recognition (HAR) models are trained on substantial labeled data. It is difficult for HAR to learn new-class activities unseen during training from a few samples. Very few researches of few-shot ...
Self-supervised learning with randomized cross-sensor masked reconstruction for human activity recognition
AbstractSelf-supervised learning (SSL) has gained prominence in the field of accelerometer-based human activity recognition (HAR) due to its ability to learn from both labeled and unlabeled data. While labeled data acquisition is costly, it is relatively ...
Graphical abstractDisplay Omitted
Highlights- New self-supervised cross-sensor learning auxiliary task (RCSMR) for HAR.
- Large-scale dual-sensor pre-training, single-sensor downstream training.
- RCSMR pre-trained Transformer outperforms supervised & SSL methods on 7 HAR ...
A survey on unsupervised learning for wearable sensor-based activity recognition
AbstractHuman Activity Recognition (HAR) is an essential task in various applications such as pervasive healthcare, smart environment, and security and surveillance. The need to develop accurate HAR systems has motivated researchers to propose various ...
Highlights- Adoption of unsupervised learning in wearable sensor-datasets.
- Survey to investigate the evolution of human activity recognition.
- Existing wearable sensor-based datasets and state-of-the-art clustering approach.
- Existing data ...
Comments