skip to main content
10.1145/3240508.3240576acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Mining Semantics-Preserving Attention for Group Activity Recognition

Published:15 October 2018Publication History

ABSTRACT

In this paper, we propose a Semantics-Preserving Teacher-Student (SPTS) model for group activity recognition in videos, which aims to mine the semantics-preserving attention to automatically seek the key people and discard the misleading people. Conventional methods usually aggregate the features extracted from individual persons by pooling operations, which cannot fully explore the contextual information for group activity recognition. To address this, our SPTS networks first learn a Teacher Network in semantic domain, which classifies the word of group activity based on the words of individual actions. Then we carefully design a Student Network in vision domain, which recognizes the group activity according to the input videos, and enforce the Student Network to mimic the Teacher Network during the learning process. In this way, we allocate semantics-preserving attention to different people, which adequately explores the contextual information of different people and requires no extra labelled data. Experimental results on two widely used benchmarks for group activity recognition clearly show the superior performance of our method in comparisons with the state-of-the-arts.

References

  1. Xin Li and Mooi Choo Chuah. SBGAR: semantics based group activity recognition. In ICCV, pages 2895--2904, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  2. Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. A hierarchical deep temporal model for group activity recognition. In CVPR, pages 1971--1980, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  3. Hossein Hajimirsadeghi, Wang Yan, Arash Vahdat, and Greg Mori. Visual recognition by counting instances: A multi-instance cardinality potential kernel. In CVPR, pages 2596--2605, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  4. Tianmin Shu, Sinisa Todorovic, and Song-Chun Zhu. CERN: confidence-energy recurrent network for group activity recognition. In CVPR, pages 4255--4263, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  5. Timur M. Bagautdinov, Alexandre Alahi, François Fleuret, Pascal Fua, and Silvio Savarese. Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In CVPR, pages 3425--3434, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  6. Minsi Wang, Bingbing Ni, and Xiaokang Yang. Recurrent modeling of interaction context for collective activity recognition. In CVPR, pages 7408--7416, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  7. Wongun Choi and Silvio Savarese. A unified framework for multi-target tracking and collective activity recognition. In ECCV, pages 215--230, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Wongun Choi, Khuram Shahid, and Silvio Savarese. Learning context for collective activity recognition. In CVPR, pages 3273--3280, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Tian Lan, Yang Wang, Weilong Yang, Stephen N. Robinovitch, and Greg Mori. Discriminative latent models for recognizing contextual group activities. TPAMI, 34(8):1549--1562, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Tianmin Shu, Dan Xie, Brandon Rothrock, Sinisa Todorovic, and Song-Chun Zhu. Joint inference of groups, events and human roles in aerial videos. In CVPR, pages 4576--4584, 2015.Google ScholarGoogle Scholar
  11. Yongyi Tang, Peizhen Zhang, Jianfang Hu, and Wei-Shi Zheng. Latent embeddings for collective activity recognition. In AVSS, pages 1--6, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  12. Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In CVPR, pages 3169--3176, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Alejandro Hernandez Ruiz, Lorenzo Porzi, Samuel Rota Bulò, and Francesc Moreno-Noguer. 3d cnns on distance matrices for human action recognition. In ACM MM, pages 1087--1095, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Xinyu Li, Yanyi Zhang, Jianyu Zhang, Yueyang Chen, Huangcan Li, Ivan Marsic, and Randall S. Burd. Region-based activity recognition using conditional GAN. In ACM MM, pages 1059--1067, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Congqi Cao, Yifan Zhang, and Hanqing Lu. Spatio-temporal triangular-chain CRF for activity recognition. In ACM MM, pages 1151--1154, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Zheng Zhou, Kan Li, and Xiangjian He. Recognizing human activity in still images by integrating group-based contextual cues. In ACM MM, pages 1135--1138, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Wenchao Jiang and Zhaozheng Yin. Human activity recognition using wearable sensors by deep convolutional neural networks. In ACM MM, pages 1307--1310, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221--231, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jen-Yin Chang, Kuan-Ying Lee, Yu-Lin Wei, Kate Ching-Ju Lin, and Winston Hsu. Location-independent wifi action recognition via vision-based methods. In ACM MM, pages 162--166, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yi Tian, Qiuqi Ruan, Gaoyun An, and Yun Fu. Action recognition using local consistent group sparse coding with spatio-temporal structure. In ACM MM, pages 317--321, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Pichao Wang, Zhaoyang Li, Yonghong Hou, and Wanqing Li. Action recognition based on joint trajectory maps using convolutional neural networks. In ACM MM, pages 102--106, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Lei Wang, Xu Zhao, Yunfei Si, Liangliang Cao, and Yuncai Liu. Context-associative hierarchical memory model for human activity recognition and prediction. TMM, 19(3):646--659, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Wanru Xu, Zhenjiang Miao, Xiao-Ping Zhang, and Yi Tian. A hierarchical spatio-temporal model for human activity recognition. TMM, 19(7):1494--1509, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  24. Yanan Guo, Dapeng Tao, Jun Cheng, Alan Dougherty, Yaotang Li, Kun Yue, and Bob Zhang. Tensor manifold discriminant projections for acceleration-based human activity recognition. TMM, 18(10):1977--1987, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  25. Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568--576, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, and Jie Zhou. Action recognition in rgb-d egocentric videos. In ICIP, pages 3410--3414, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  27. Yansong Tang, Yi Tian, Jiwen Lu, Peiyang Li, and Jie Zhou. Deep progressive reinforcement learning for skeleton-based action recognition. In CVPR, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  28. John K. Tsotsos, Sean M. Culhane, Winky Yan Kei Wai, Yuzhong Lai, Neal Davis, and Fernando Nuflo. Modeling visual attention via selective tuning. Artificial Intelligence, 78(12):507--545, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ronald A. Rensink. The dynamic representation of scenes. Visual Cognition, 7(1-3):17--42, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  30. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2014.Google ScholarGoogle Scholar
  31. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 6000--6010, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen, Hongdong Li, and Gang Hua. Neural aggregation network for video face recognition. In CVPR, pages 5216--5225, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  33. Yongming Rao, Jiwen Lu, and Jie Zhou. Attention-aware deep reinforcement learning for video face recognition. In ICCV, pages 3931--3940, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  34. Albert Haque, Alexandre Alahi, and Li Fei-Fei. Recurrent attention models for depth-based person identification. In CVPR, pages 1229--1238, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  35. Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free? weakly-supervised learning with convolutional neural networks. In CVPR, pages 685--694, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  36. Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. Paying more attention to saliency: Image captioning with saliency and context attention. In ICLR, 2017.Google ScholarGoogle Scholar
  37. Zhao Guo, Lianli Gao, Jingkuan Song, Xing Xu, Jie Shao, and Heng Tao Shen. Attention-based LSTM with semantic consistency for videos captioning. In ACM MM, pages 357--361, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. Stacked attention networks for image question answering. In CVPR, pages 21--29, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  39. Jun Liu, Gang Wang, Ling-Yu Duan, Kamila Abdiyeva, and Alex C. Kot. Skeleton-based human action recognition with global context-aware attention LSTM networks. TIP, 27(4):1586--1599, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In AAAI, pages 4263--4270, 2017.Google ScholarGoogle Scholar
  41. Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. 2014.Google ScholarGoogle Scholar
  42. Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2014.Google ScholarGoogle Scholar
  43. Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR, pages 7130--7138, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  44. Tianqi Chen, Ian J. Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. In ICLR, 2015.Google ScholarGoogle Scholar
  45. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106--1114, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. Hierarchical deep temporal models for group activity recognition. CoRR, abs/1607.02643, 2016.Google ScholarGoogle Scholar
  47. Wongun Choi, Khuram Shahid, and Silvio Savarese. What are they doing? Collective activity classification using spatio-temporal relationship among people. In ICCVW, pages 1282--1289, 2009.Google ScholarGoogle Scholar
  48. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.Google ScholarGoogle Scholar
  49. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. IJCV, 115(3):211--252, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, pages 1647--1655, 2017.Google ScholarGoogle Scholar
  51. Sovan Biswas and Juergen Gall. Structural recurrent neural network (srnn) for group activity analysis. In WACV, 2018.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Mining Semantics-Preserving Attention for Group Activity Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '18: Proceedings of the 26th ACM international conference on Multimedia
      October 2018
      2167 pages
      ISBN:9781450356657
      DOI:10.1145/3240508

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 October 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader