research-article

Mining Semantics-Preserving Attention for Group Activity Recognition

Authors:
Yansong Tang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Zian Wang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Peiyang Li

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Jiwen Lu

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Ming Yang

Horizon Robotics, Inc., Beijing, China

Horizon Robotics, Inc., Beijing, China
View Profile

,
Jie Zhou

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

MM '18: Proceedings of the 26th ACM international conference on MultimediaOctober 2018Pages 1283–1291https://doi.org/10.1145/3240508.3240576

Published:15 October 2018Publication History

MM '18: Proceedings of the 26th ACM international conference on Multimedia

Pages 1283–1291

ABSTRACT

In this paper, we propose a Semantics-Preserving Teacher-Student (SPTS) model for group activity recognition in videos, which aims to mine the semantics-preserving attention to automatically seek the key people and discard the misleading people. Conventional methods usually aggregate the features extracted from individual persons by pooling operations, which cannot fully explore the contextual information for group activity recognition. To address this, our SPTS networks first learn a Teacher Network in semantic domain, which classifies the word of group activity based on the words of individual actions. Then we carefully design a Student Network in vision domain, which recognizes the group activity according to the input videos, and enforce the Student Network to mimic the Teacher Network during the learning process. In this way, we allocate semantics-preserving attention to different people, which adequately explores the contextual information of different people and requires no extra labelled data. Experimental results on two widely used benchmarks for group activity recognition clearly show the superior performance of our method in comparisons with the state-of-the-arts.

References

Xin Li and Mooi Choo Chuah. SBGAR: semantics based group activity recognition. In ICCV, pages 2895--2904, 2017.Google ScholarCross Ref
Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. A hierarchical deep temporal model for group activity recognition. In CVPR, pages 1971--1980, 2016.Google ScholarCross Ref
Hossein Hajimirsadeghi, Wang Yan, Arash Vahdat, and Greg Mori. Visual recognition by counting instances: A multi-instance cardinality potential kernel. In CVPR, pages 2596--2605, 2015.Google ScholarCross Ref
Tianmin Shu, Sinisa Todorovic, and Song-Chun Zhu. CERN: confidence-energy recurrent network for group activity recognition. In CVPR, pages 4255--4263, 2017.Google ScholarCross Ref
Timur M. Bagautdinov, Alexandre Alahi, François Fleuret, Pascal Fua, and Silvio Savarese. Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In CVPR, pages 3425--3434, 2017.Google ScholarCross Ref
Minsi Wang, Bingbing Ni, and Xiaokang Yang. Recurrent modeling of interaction context for collective activity recognition. In CVPR, pages 7408--7416, 2017.Google ScholarCross Ref
Wongun Choi and Silvio Savarese. A unified framework for multi-target tracking and collective activity recognition. In ECCV, pages 215--230, 2012. Google ScholarDigital Library
Wongun Choi, Khuram Shahid, and Silvio Savarese. Learning context for collective activity recognition. In CVPR, pages 3273--3280, 2011. Google ScholarDigital Library
Tian Lan, Yang Wang, Weilong Yang, Stephen N. Robinovitch, and Greg Mori. Discriminative latent models for recognizing contextual group activities. TPAMI, 34(8):1549--1562, 2012. Google ScholarDigital Library
Tianmin Shu, Dan Xie, Brandon Rothrock, Sinisa Todorovic, and Song-Chun Zhu. Joint inference of groups, events and human roles in aerial videos. In CVPR, pages 4576--4584, 2015.Google Scholar
Yongyi Tang, Peizhen Zhang, Jianfang Hu, and Wei-Shi Zheng. Latent embeddings for collective activity recognition. In AVSS, pages 1--6, 2017.Google ScholarCross Ref
Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In CVPR, pages 3169--3176, 2011. Google ScholarDigital Library
Alejandro Hernandez Ruiz, Lorenzo Porzi, Samuel Rota Bulò, and Francesc Moreno-Noguer. 3d cnns on distance matrices for human action recognition. In ACM MM, pages 1087--1095, 2017. Google ScholarDigital Library
Xinyu Li, Yanyi Zhang, Jianyu Zhang, Yueyang Chen, Huangcan Li, Ivan Marsic, and Randall S. Burd. Region-based activity recognition using conditional GAN. In ACM MM, pages 1059--1067, 2017. Google ScholarDigital Library
Congqi Cao, Yifan Zhang, and Hanqing Lu. Spatio-temporal triangular-chain CRF for activity recognition. In ACM MM, pages 1151--1154, 2015. Google ScholarDigital Library
Zheng Zhou, Kan Li, and Xiangjian He. Recognizing human activity in still images by integrating group-based contextual cues. In ACM MM, pages 1135--1138, 2015. Google ScholarDigital Library
Wenchao Jiang and Zhaozheng Yin. Human activity recognition using wearable sensors by deep convolutional neural networks. In ACM MM, pages 1307--1310, 2015. Google ScholarDigital Library
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221--231, 2013. Google ScholarDigital Library
Jen-Yin Chang, Kuan-Ying Lee, Yu-Lin Wei, Kate Ching-Ju Lin, and Winston Hsu. Location-independent wifi action recognition via vision-based methods. In ACM MM, pages 162--166, 2016. Google ScholarDigital Library
Yi Tian, Qiuqi Ruan, Gaoyun An, and Yun Fu. Action recognition using local consistent group sparse coding with spatio-temporal structure. In ACM MM, pages 317--321, 2016. Google ScholarDigital Library
Pichao Wang, Zhaoyang Li, Yonghong Hou, and Wanqing Li. Action recognition based on joint trajectory maps using convolutional neural networks. In ACM MM, pages 102--106, 2016. Google ScholarDigital Library
Lei Wang, Xu Zhao, Yunfei Si, Liangliang Cao, and Yuncai Liu. Context-associative hierarchical memory model for human activity recognition and prediction. TMM, 19(3):646--659, 2017. Google ScholarDigital Library
Wanru Xu, Zhenjiang Miao, Xiao-Ping Zhang, and Yi Tian. A hierarchical spatio-temporal model for human activity recognition. TMM, 19(7):1494--1509, 2017.Google ScholarCross Ref
Yanan Guo, Dapeng Tao, Jun Cheng, Alan Dougherty, Yaotang Li, Kun Yue, and Bob Zhang. Tensor manifold discriminant projections for acceleration-based human activity recognition. TMM, 18(10):1977--1987, 2016.Google ScholarCross Ref
Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568--576, 2014. Google ScholarDigital Library
Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, and Jie Zhou. Action recognition in rgb-d egocentric videos. In ICIP, pages 3410--3414, 2017.Google ScholarCross Ref
Yansong Tang, Yi Tian, Jiwen Lu, Peiyang Li, and Jie Zhou. Deep progressive reinforcement learning for skeleton-based action recognition. In CVPR, 2018.Google ScholarCross Ref
John K. Tsotsos, Sean M. Culhane, Winky Yan Kei Wai, Yuzhong Lai, Neal Davis, and Fernando Nuflo. Modeling visual attention via selective tuning. Artificial Intelligence, 78(12):507--545, 1995. Google ScholarDigital Library
Ronald A. Rensink. The dynamic representation of scenes. Visual Cognition, 7(1-3):17--42, 2000.Google ScholarCross Ref
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2014.Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 6000--6010, 2017.Google ScholarDigital Library
Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen, Hongdong Li, and Gang Hua. Neural aggregation network for video face recognition. In CVPR, pages 5216--5225, 2017.Google ScholarCross Ref
Yongming Rao, Jiwen Lu, and Jie Zhou. Attention-aware deep reinforcement learning for video face recognition. In ICCV, pages 3931--3940, 2017.Google ScholarCross Ref
Albert Haque, Alexandre Alahi, and Li Fei-Fei. Recurrent attention models for depth-based person identification. In CVPR, pages 1229--1238, 2016.Google ScholarCross Ref
Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free? weakly-supervised learning with convolutional neural networks. In CVPR, pages 685--694, 2015.Google ScholarCross Ref
Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. Paying more attention to saliency: Image captioning with saliency and context attention. In ICLR, 2017.Google Scholar
Zhao Guo, Lianli Gao, Jingkuan Song, Xing Xu, Jie Shao, and Heng Tao Shen. Attention-based LSTM with semantic consistency for videos captioning. In ACM MM, pages 357--361, 2016. Google ScholarDigital Library
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. Stacked attention networks for image question answering. In CVPR, pages 21--29, 2016.Google ScholarCross Ref
Jun Liu, Gang Wang, Ling-Yu Duan, Kamila Abdiyeva, and Alex C. Kot. Skeleton-based human action recognition with global context-aware attention LSTM networks. TIP, 27(4):1586--1599, 2018. Google ScholarDigital Library
Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In AAAI, pages 4263--4270, 2017.Google Scholar
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. 2014.Google Scholar
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2014.Google Scholar
Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR, pages 7130--7138, 2017.Google ScholarCross Ref
Tianqi Chen, Ian J. Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. In ICLR, 2015.Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106--1114, 2012. Google ScholarDigital Library
Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. Hierarchical deep temporal models for group activity recognition. CoRR, abs/1607.02643, 2016.Google Scholar
Wongun Choi, Khuram Shahid, and Silvio Savarese. What are they doing? Collective activity classification using spatio-temporal relationship among people. In ICCVW, pages 1282--1289, 2009.Google Scholar
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.Google Scholar
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. IJCV, 115(3):211--252, 2015. Google ScholarDigital Library
Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, pages 1647--1655, 2017.Google Scholar
Sovan Biswas and Juergen Gall. Structural recurrent neural network (srnn) for group activity analysis. In WACV, 2018.Google ScholarCross Ref

Index Terms

Mining Semantics-Preserving Attention for Group Activity Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Participation-Contributed Temporal Dynamic Model for Group Activity Recognition
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Group activity recognition, a challenging task that a number of individuals occur in the scene of activity while only a small subset of them participate in, has received increasing attentions. However, most of the previous methods model all the ...
Read More
Relation-Guided Actor Attention for Group Activity Recognition
Pattern Recognition and Computer Vision
Abstract
Group activity recognition has received significant interest due to its widely practical applications in sports analysis, intelligent surveillance and abnormal behavior detection. In a complex multi-person scenario, only a few key actors ...
Read More
Towards Collaborative Group Activity Recognition Using Mobile Devices

In this paper, we present a novel approach for distributed recognition of collaborative group activities using only mobile devices and their sensors. Information must be exchanged between nodes for effective group activity recognition (GAR). Here we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '18: Proceedings of the 26th ACM international conference on Multimedia
October 2018
2167 pages
ISBN:9781450356657
DOI:10.1145/3240508
General Chairs:
Susanne Boll
University of Oldenburg, Germany
,
Kyoung Mu Lee
Seoul National University, Korea
,
Jiebo Luo
University of Rochester, USA
,
Wenwu Zhu
Tsinghua University, China
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Chang Wen Chen
State Univ. Of New York at Buffalo, USA
,
Rainer Lienhart
University of Augsburg, Germany
,
Tao Mei
JD AI, China
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
attention
group activity recognition
semantic-preserving
teacher-student networks
Qualifiers
- research-article
Conference

Acceptance Rates
MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 34
  Total Citations
  View Citations
- 390
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining Semantics-Preserving Attention for Group Activity Recognition

MM '18: Proceedings of the 26th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Participation-Contributed Temporal Dynamic Model for Group Activity Recognition

Relation-Guided Actor Attention for Group Activity Recognition

Towards Collaborative Group Activity Recognition Using Mobile Devices