ABSTRACT
In this paper, we propose a Semantics-Preserving Teacher-Student (SPTS) model for group activity recognition in videos, which aims to mine the semantics-preserving attention to automatically seek the key people and discard the misleading people. Conventional methods usually aggregate the features extracted from individual persons by pooling operations, which cannot fully explore the contextual information for group activity recognition. To address this, our SPTS networks first learn a Teacher Network in semantic domain, which classifies the word of group activity based on the words of individual actions. Then we carefully design a Student Network in vision domain, which recognizes the group activity according to the input videos, and enforce the Student Network to mimic the Teacher Network during the learning process. In this way, we allocate semantics-preserving attention to different people, which adequately explores the contextual information of different people and requires no extra labelled data. Experimental results on two widely used benchmarks for group activity recognition clearly show the superior performance of our method in comparisons with the state-of-the-arts.
- Xin Li and Mooi Choo Chuah. SBGAR: semantics based group activity recognition. In ICCV, pages 2895--2904, 2017.Google ScholarCross Ref
- Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. A hierarchical deep temporal model for group activity recognition. In CVPR, pages 1971--1980, 2016.Google ScholarCross Ref
- Hossein Hajimirsadeghi, Wang Yan, Arash Vahdat, and Greg Mori. Visual recognition by counting instances: A multi-instance cardinality potential kernel. In CVPR, pages 2596--2605, 2015.Google ScholarCross Ref
- Tianmin Shu, Sinisa Todorovic, and Song-Chun Zhu. CERN: confidence-energy recurrent network for group activity recognition. In CVPR, pages 4255--4263, 2017.Google ScholarCross Ref
- Timur M. Bagautdinov, Alexandre Alahi, François Fleuret, Pascal Fua, and Silvio Savarese. Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In CVPR, pages 3425--3434, 2017.Google ScholarCross Ref
- Minsi Wang, Bingbing Ni, and Xiaokang Yang. Recurrent modeling of interaction context for collective activity recognition. In CVPR, pages 7408--7416, 2017.Google ScholarCross Ref
- Wongun Choi and Silvio Savarese. A unified framework for multi-target tracking and collective activity recognition. In ECCV, pages 215--230, 2012. Google ScholarDigital Library
- Wongun Choi, Khuram Shahid, and Silvio Savarese. Learning context for collective activity recognition. In CVPR, pages 3273--3280, 2011. Google ScholarDigital Library
- Tian Lan, Yang Wang, Weilong Yang, Stephen N. Robinovitch, and Greg Mori. Discriminative latent models for recognizing contextual group activities. TPAMI, 34(8):1549--1562, 2012. Google ScholarDigital Library
- Tianmin Shu, Dan Xie, Brandon Rothrock, Sinisa Todorovic, and Song-Chun Zhu. Joint inference of groups, events and human roles in aerial videos. In CVPR, pages 4576--4584, 2015.Google Scholar
- Yongyi Tang, Peizhen Zhang, Jianfang Hu, and Wei-Shi Zheng. Latent embeddings for collective activity recognition. In AVSS, pages 1--6, 2017.Google ScholarCross Ref
- Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In CVPR, pages 3169--3176, 2011. Google ScholarDigital Library
- Alejandro Hernandez Ruiz, Lorenzo Porzi, Samuel Rota Bulò, and Francesc Moreno-Noguer. 3d cnns on distance matrices for human action recognition. In ACM MM, pages 1087--1095, 2017. Google ScholarDigital Library
- Xinyu Li, Yanyi Zhang, Jianyu Zhang, Yueyang Chen, Huangcan Li, Ivan Marsic, and Randall S. Burd. Region-based activity recognition using conditional GAN. In ACM MM, pages 1059--1067, 2017. Google ScholarDigital Library
- Congqi Cao, Yifan Zhang, and Hanqing Lu. Spatio-temporal triangular-chain CRF for activity recognition. In ACM MM, pages 1151--1154, 2015. Google ScholarDigital Library
- Zheng Zhou, Kan Li, and Xiangjian He. Recognizing human activity in still images by integrating group-based contextual cues. In ACM MM, pages 1135--1138, 2015. Google ScholarDigital Library
- Wenchao Jiang and Zhaozheng Yin. Human activity recognition using wearable sensors by deep convolutional neural networks. In ACM MM, pages 1307--1310, 2015. Google ScholarDigital Library
- Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221--231, 2013. Google ScholarDigital Library
- Jen-Yin Chang, Kuan-Ying Lee, Yu-Lin Wei, Kate Ching-Ju Lin, and Winston Hsu. Location-independent wifi action recognition via vision-based methods. In ACM MM, pages 162--166, 2016. Google ScholarDigital Library
- Yi Tian, Qiuqi Ruan, Gaoyun An, and Yun Fu. Action recognition using local consistent group sparse coding with spatio-temporal structure. In ACM MM, pages 317--321, 2016. Google ScholarDigital Library
- Pichao Wang, Zhaoyang Li, Yonghong Hou, and Wanqing Li. Action recognition based on joint trajectory maps using convolutional neural networks. In ACM MM, pages 102--106, 2016. Google ScholarDigital Library
- Lei Wang, Xu Zhao, Yunfei Si, Liangliang Cao, and Yuncai Liu. Context-associative hierarchical memory model for human activity recognition and prediction. TMM, 19(3):646--659, 2017. Google ScholarDigital Library
- Wanru Xu, Zhenjiang Miao, Xiao-Ping Zhang, and Yi Tian. A hierarchical spatio-temporal model for human activity recognition. TMM, 19(7):1494--1509, 2017.Google ScholarCross Ref
- Yanan Guo, Dapeng Tao, Jun Cheng, Alan Dougherty, Yaotang Li, Kun Yue, and Bob Zhang. Tensor manifold discriminant projections for acceleration-based human activity recognition. TMM, 18(10):1977--1987, 2016.Google ScholarCross Ref
- Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568--576, 2014. Google ScholarDigital Library
- Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, and Jie Zhou. Action recognition in rgb-d egocentric videos. In ICIP, pages 3410--3414, 2017.Google ScholarCross Ref
- Yansong Tang, Yi Tian, Jiwen Lu, Peiyang Li, and Jie Zhou. Deep progressive reinforcement learning for skeleton-based action recognition. In CVPR, 2018.Google ScholarCross Ref
- John K. Tsotsos, Sean M. Culhane, Winky Yan Kei Wai, Yuzhong Lai, Neal Davis, and Fernando Nuflo. Modeling visual attention via selective tuning. Artificial Intelligence, 78(12):507--545, 1995. Google ScholarDigital Library
- Ronald A. Rensink. The dynamic representation of scenes. Visual Cognition, 7(1-3):17--42, 2000.Google ScholarCross Ref
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2014.Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 6000--6010, 2017.Google ScholarDigital Library
- Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen, Hongdong Li, and Gang Hua. Neural aggregation network for video face recognition. In CVPR, pages 5216--5225, 2017.Google ScholarCross Ref
- Yongming Rao, Jiwen Lu, and Jie Zhou. Attention-aware deep reinforcement learning for video face recognition. In ICCV, pages 3931--3940, 2017.Google ScholarCross Ref
- Albert Haque, Alexandre Alahi, and Li Fei-Fei. Recurrent attention models for depth-based person identification. In CVPR, pages 1229--1238, 2016.Google ScholarCross Ref
- Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free? weakly-supervised learning with convolutional neural networks. In CVPR, pages 685--694, 2015.Google ScholarCross Ref
- Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. Paying more attention to saliency: Image captioning with saliency and context attention. In ICLR, 2017.Google Scholar
- Zhao Guo, Lianli Gao, Jingkuan Song, Xing Xu, Jie Shao, and Heng Tao Shen. Attention-based LSTM with semantic consistency for videos captioning. In ACM MM, pages 357--361, 2016. Google ScholarDigital Library
- Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. Stacked attention networks for image question answering. In CVPR, pages 21--29, 2016.Google ScholarCross Ref
- Jun Liu, Gang Wang, Ling-Yu Duan, Kamila Abdiyeva, and Alex C. Kot. Skeleton-based human action recognition with global context-aware attention LSTM networks. TIP, 27(4):1586--1599, 2018. Google ScholarDigital Library
- Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In AAAI, pages 4263--4270, 2017.Google Scholar
- Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. 2014.Google Scholar
- Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2014.Google Scholar
- Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR, pages 7130--7138, 2017.Google ScholarCross Ref
- Tianqi Chen, Ian J. Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. In ICLR, 2015.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106--1114, 2012. Google ScholarDigital Library
- Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. Hierarchical deep temporal models for group activity recognition. CoRR, abs/1607.02643, 2016.Google Scholar
- Wongun Choi, Khuram Shahid, and Silvio Savarese. What are they doing? Collective activity classification using spatio-temporal relationship among people. In ICCVW, pages 1282--1289, 2009.Google Scholar
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.Google Scholar
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. IJCV, 115(3):211--252, 2015. Google ScholarDigital Library
- Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, pages 1647--1655, 2017.Google Scholar
- Sovan Biswas and Juergen Gall. Structural recurrent neural network (srnn) for group activity analysis. In WACV, 2018.Google ScholarCross Ref
Index Terms
- Mining Semantics-Preserving Attention for Group Activity Recognition
Recommendations
Participation-Contributed Temporal Dynamic Model for Group Activity Recognition
MM '18: Proceedings of the 26th ACM international conference on MultimediaGroup activity recognition, a challenging task that a number of individuals occur in the scene of activity while only a small subset of them participate in, has received increasing attentions. However, most of the previous methods model all the ...
Relation-Guided Actor Attention for Group Activity Recognition
Pattern Recognition and Computer VisionAbstractGroup activity recognition has received significant interest due to its widely practical applications in sports analysis, intelligent surveillance and abnormal behavior detection. In a complex multi-person scenario, only a few key actors ...
Towards Collaborative Group Activity Recognition Using Mobile Devices
In this paper, we present a novel approach for distributed recognition of collaborative group activities using only mobile devices and their sensors. Information must be exchanged between nodes for effective group activity recognition (GAR). Here we ...
Comments