research-article

A Unified Framework for Multi-Modal Isolated Gesture Recognition

Authors:
Jiali Duan

CBSR 8 NLPR, Institute of Automation, Chinese Academy of Sciences

CBSR 8 NLPR, Institute of Automation, Chinese Academy of Sciences

0000-0003-4024-0300
View Profile

,
Jun Wan

CBSR 8 NLPR, Institute of Automation, Chinese Academy of Sciences

CBSR 8 NLPR, Institute of Automation, Chinese Academy of Sciences
View Profile

,
Shuai Zhou

Macau University of Science and Technology

Macau University of Science and Technology
View Profile

,
Xiaoyuan Guo

School of Engineering Science, University of Chinese Academy of Sciences

School of Engineering Science, University of Chinese Academy of Sciences
View Profile

,
Stan Z. Li

CBSR 8 NLPR, Institute of Automation, Chinese Academy of Sciences

CBSR 8 NLPR, Institute of Automation, Chinese Academy of Sciences
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 14 Issue 1sArticle No.: 21pp 1–16https://doi.org/10.1145/3131343

Published:21 February 2018Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

In this article, we focus on isolated gesture recognition and explore different modalities by involving RGB stream, depth stream, and saliency stream for inspection. Our goal is to push the boundary of this realm even further by proposing a unified framework that exploits the advantages of multi-modality fusion. Specifically, a spatial-temporal network architecture based on consensus-voting has been proposed to explicitly model the long-term structure of the video sequence and to reduce estimation variance when confronted with comprehensive inter-class variations. In addition, a three-dimensional depth-saliency convolutional network is aggregated in parallel to capture subtle motion characteristics. Extensive experiments are done to analyze the performance of each component and our proposed approach achieves the best results on two public benchmarks, ChaLearn IsoGD and RGBD-HuDaAct, outperforming the closest competitor by a margin of over 10% and 15%, respectively. Our project and codes will be released at https://davidsonic.github.io/index/acm_tomm_2017.html.

Supplemental Material

Available for Download

zip

duan.zip (6.2 MB)

Supplemental movie, appendix, image and software files for, A Unified Framework for Multi-Modal Isolated Gesture Recognition

References

Radhakrishna Achanta, Sheila S. Hemami, Francisco Estrada, and Sabine Susstrunk. 2009. Frequency-tuned salient region detection. (2009).Google Scholar
N. H. Dardas and Nicolas D. Georganas. 2011. Real-time hand gesture detection and recognition using bag-of-features and support vector machine techniques. IEEE Trans. Instrum. Meas. 60, 11 (2011), 3592--3607.Google ScholarCross Ref
James W. Davis and Aaron F. Bobick. 2000. The representation and recognition of action using temporal templates. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’00). 928--934. Google ScholarDigital Library
Jia Deng, Wei Dong, R. Socher, Li Jia Li, Kai Li, and Fei Fei Li. 2009. ImageNet: A large-scale hierarchical image database. 248--255.Google Scholar
Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description. (2015).Google Scholar
Wan Jun Escalante H. J. and Ponce-Lpez V. 2016a. Chalearn joint contest on multimedia challenges beyond visual analysis: An overview. In Proceedings of the International Conference on Pattern Recognition (2016).Google Scholar
Wan Jun Escalante H. J. and Ponce-Lpez V. 2016b. ChaLearn LAP Large-scale Isolated Gesture Recognition Challenge. Retreived from https://competitions.codalab.org/competitions/10331.Google Scholar
Georgios D. Evangelidis, Gurkirt Singh, and Radu Horaud. 2014. Continuous gesture recognition from articulated poses. In Proceedings of theWorkshop at the European Conference on Computer Vision. Springer, 595--607.Google Scholar
Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016a. Spatiotemporal residual networks for video action recognition. In Advances in Neural Information Processing Systems. 3468--3476. Google ScholarDigital Library
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016b. Convolutional two-stream network fusion for video action recognition.Google Scholar
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Computer Science (2015).Google Scholar
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv Preprint arXiv:1408.5093 (2014).Google Scholar
Qiuqi Ruan Jun Wan and Wei Li. 2014. 3D SMoSIFT: Three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos. J. Electron. Imag. 23, 2 (2014), 1709--1717.Google Scholar
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Feifei. 2014. Large-scale video classification with convolutional neural networks. (2014).Google Scholar
I. Laptev and T. Lindeberg. 2003. Space-time interest points. In Proceedings of the International Conference on Computer Vision. 432--439. Google ScholarDigital Library
Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarCross Ref
Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. 2016. Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarCross Ref
Pavlo Molchanov, Shalini Gupta, Kihwan Kim, and Kari Pulli. 2015. Multi-sensor system for driver’s hand-gesture recognition. Proceedings of the 11th IEEE International Conference and Workshops Automatic Face and Gesture Recognition (FG’15). 1--8.Google ScholarCross Ref
Natalia Neverova, Christian Wolf, Graham W. Taylor, and Florian Nebout. 2015. Multi-scale Deep Learning for Gesture Detection and Localization. 474--490.Google Scholar
Joe Yuehei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. (2015).Google Scholar
Bingbing Ni, Gang Wang, and Pierre Moulin. 2011. RGBD-HuDaAct: A color-depth video database for human daily activity recognition. (2011).Google Scholar
Noriki Nishida and Hideki Nakayama. 2015. Multimodal Gesture Recognition Using Multi-stream Recurrent Neural Network. Springer-Verlag, New York, NY.Google Scholar
Eshed Ohn-Bar and Mohan Manubhai Trivedi. 2014. Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Trans. Intell. Transport. Syst. 15, 6 (2014), 2368--2377.Google ScholarCross Ref
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. (2013).Google Scholar
Pichao Wang et al. 2016. Large-scale isolated gesture recognition using convolutional neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 27--12.Google Scholar
Lionel Pigou, Aaron Van Den Oord, Sander Dieleman, Mieke Van Herreweghe, and Joni Dambre. 2015. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video. Int. J. Comput. Vis. (2015), 1--10. Google ScholarDigital Library
Amir Shahroudy, Jun Liu, Tiantsong Ng, and Gang Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. (2016).Google Scholar
Karen Simonyan and Andrew Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. (2014).Google Scholar
Karen Simonyan and Andrew Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition. Computer Science (2014).Google Scholar
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929--1958. Google ScholarDigital Library
Thad Starner, Alex Pentland, and Joshua Weaver. 1998. Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Analy. Mach. Intell. 20, 12 (1998), 1371--1375. Google ScholarDigital Library
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. Computer Science (2015).Google Scholar
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. (2015).Google Scholar
Pedro Trindade, Jorge Lobo, and Joao P. Barreto. 2012. Hand gesture recognition using color and depth images enhanced with hand angular pose data. In Multisensor Fusion and Integration for Intelligent Systems. 71--76.Google Scholar
J. Wan, V. Athitsos, P. Jangyodsuk, H. J. Escalante, Q. Ruan, and I. Guyon. 2014. CSMMI: Class-specific maximization of mutual information for action and gesture recognition. IEEE Trans. Image Process. 23, 7 (2014), 3152--3165.Google ScholarCross Ref
Jun Wan, Guodong Guo, and Stan Z Li. 2016. Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38, 8 (2016), 1626--1639.Google ScholarDigital Library
Jun Wan, Qiuqi Ruan, Wei Li, and Shuang Deng. 2013. One-shot learning gesture recognition from RGB-D data using bag of features. J. Mach. Learn. Res. 14, 1 (2013), 2549--2582. Google ScholarDigital Library
Jun Wan, Yibing Zhao, Shuai Zhou, Isabelle Guyon, Sergio Escalera, and Stan Z. Li. 2016. ChaLearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16) Workshops.Google Scholar
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. (2013).Google Scholar
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. Springer, Berlin.Google Scholar
Sy Bor Wang, Ariadna Quattoni, Louis Philippe Morency, and David Demirdjian. 2006. Hidden conditional random fields for gesture recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1521--1527. Google ScholarDigital Library
Andreas Wedel, Thomas Pock, Christopher Zach, Horst Bischof, and Daniel Cremers. 2009. An improved algorithm for TV-L1 optical flow. Lect. Not. Compute. Sci. 5604, 7 (2009), 23--45.Google ScholarDigital Library
Xiujuan Chai et al. 2016. Two streams recurrent neural networks for large-scale continuous gesture recognition. In 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 31--36.Google Scholar
Ying Yin and Randall Davis. 2014. Real-time continuous gesture recognition for natural human-computer interaction. In Proceedings of the 2014 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC’14). IEEE, 113--120.Google ScholarCross Ref
Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Real-time action recognition with enhanced motion vector CNNs.Google Scholar
Yi Zhu and Shawn Newsam. 2016. Depth2Action: Exploring Embedded Depth for Large-Scale Action Recognition. Springer, Berlin.Google Scholar

Index Terms

A Unified Framework for Multi-Modal Isolated Gesture Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
      1. Walkthrough evaluations

Recommendations

Multi-modal & Multi-view & Interactive Benchmark Dataset for Human Action Recognition
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Human action recognition is one of the most active research areas in both computer vision and machine learning communities. Several methods for human action recognition have been proposed in the literature and promising results have been achieved on the ...
Read More
Multi-modal Multi-view Topic-opinion Mining for Social Event Analysis
MM '16: Proceedings of the 24th ACM international conference on Multimedia

In this paper, we propose a novel multi-modal multi-view topic-opinion mining (MMTOM) model for social event analysis in multiple collection sources. Compared with existing topic-opinion mining methods, our proposed model has several advantages: (1) The ...
Read More
Adaptive facial expression recognition using inter-modal top-down context
ICMI '11: Proceedings of the 13th international conference on multimodal interfaces

The role of context in recognizing a person's affect is being increasingly studied. In particular, context arising from the presence of multi-modal information such as faces, speech and head pose has been used in recent studies to recognize facial ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 14, Issue 1s
Special Section on Representation, Analysis and Recognition of 3D Humans and Special Section on Multimedia Computing and Applications of Socio-Affective Behaviors in the Wild
March 2018
234 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3190503
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 February 2018
- Accepted: 1 August 2017
- Revised: 1 June 2017
- Received: 1 January 2017
Published in tomm Volume 14, Issue 1s

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
3D convolution
Multi-modal
consensus-voting
isolated gesture recognition
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 47
  Total Citations
  View Citations
- 467
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Unified Framework for Multi-Modal Isolated Gesture Recognition

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Multi-modal & Multi-view & Interactive Benchmark Dataset for Human Action Recognition

Multi-modal Multi-view Topic-opinion Mining for Social Event Analysis

Adaptive facial expression recognition using inter-modal top-down context