Abstract
In this article, we focus on isolated gesture recognition and explore different modalities by involving RGB stream, depth stream, and saliency stream for inspection. Our goal is to push the boundary of this realm even further by proposing a unified framework that exploits the advantages of multi-modality fusion. Specifically, a spatial-temporal network architecture based on consensus-voting has been proposed to explicitly model the long-term structure of the video sequence and to reduce estimation variance when confronted with comprehensive inter-class variations. In addition, a three-dimensional depth-saliency convolutional network is aggregated in parallel to capture subtle motion characteristics. Extensive experiments are done to analyze the performance of each component and our proposed approach achieves the best results on two public benchmarks, ChaLearn IsoGD and RGBD-HuDaAct, outperforming the closest competitor by a margin of over 10% and 15%, respectively. Our project and codes will be released at https://davidsonic.github.io/index/acm_tomm_2017.html.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, A Unified Framework for Multi-Modal Isolated Gesture Recognition
- Radhakrishna Achanta, Sheila S. Hemami, Francisco Estrada, and Sabine Susstrunk. 2009. Frequency-tuned salient region detection. (2009).Google Scholar
- N. H. Dardas and Nicolas D. Georganas. 2011. Real-time hand gesture detection and recognition using bag-of-features and support vector machine techniques. IEEE Trans. Instrum. Meas. 60, 11 (2011), 3592--3607.Google ScholarCross Ref
- James W. Davis and Aaron F. Bobick. 2000. The representation and recognition of action using temporal templates. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’00). 928--934. Google ScholarDigital Library
- Jia Deng, Wei Dong, R. Socher, Li Jia Li, Kai Li, and Fei Fei Li. 2009. ImageNet: A large-scale hierarchical image database. 248--255.Google Scholar
- Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description. (2015).Google Scholar
- Wan Jun Escalante H. J. and Ponce-Lpez V. 2016a. Chalearn joint contest on multimedia challenges beyond visual analysis: An overview. In Proceedings of the International Conference on Pattern Recognition (2016).Google Scholar
- Wan Jun Escalante H. J. and Ponce-Lpez V. 2016b. ChaLearn LAP Large-scale Isolated Gesture Recognition Challenge. Retreived from https://competitions.codalab.org/competitions/10331.Google Scholar
- Georgios D. Evangelidis, Gurkirt Singh, and Radu Horaud. 2014. Continuous gesture recognition from articulated poses. In Proceedings of theWorkshop at the European Conference on Computer Vision. Springer, 595--607.Google Scholar
- Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016a. Spatiotemporal residual networks for video action recognition. In Advances in Neural Information Processing Systems. 3468--3476. Google ScholarDigital Library
- Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016b. Convolutional two-stream network fusion for video action recognition.Google Scholar
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Computer Science (2015).Google Scholar
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv Preprint arXiv:1408.5093 (2014).Google Scholar
- Qiuqi Ruan Jun Wan and Wei Li. 2014. 3D SMoSIFT: Three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos. J. Electron. Imag. 23, 2 (2014), 1709--1717.Google Scholar
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Feifei. 2014. Large-scale video classification with convolutional neural networks. (2014).Google Scholar
- I. Laptev and T. Lindeberg. 2003. Space-time interest points. In Proceedings of the International Conference on Computer Vision. 432--439. Google ScholarDigital Library
- Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarCross Ref
- Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. 2016. Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarCross Ref
- Pavlo Molchanov, Shalini Gupta, Kihwan Kim, and Kari Pulli. 2015. Multi-sensor system for driver’s hand-gesture recognition. Proceedings of the 11th IEEE International Conference and Workshops Automatic Face and Gesture Recognition (FG’15). 1--8.Google ScholarCross Ref
- Natalia Neverova, Christian Wolf, Graham W. Taylor, and Florian Nebout. 2015. Multi-scale Deep Learning for Gesture Detection and Localization. 474--490.Google Scholar
- Joe Yuehei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. (2015).Google Scholar
- Bingbing Ni, Gang Wang, and Pierre Moulin. 2011. RGBD-HuDaAct: A color-depth video database for human daily activity recognition. (2011).Google Scholar
- Noriki Nishida and Hideki Nakayama. 2015. Multimodal Gesture Recognition Using Multi-stream Recurrent Neural Network. Springer-Verlag, New York, NY.Google Scholar
- Eshed Ohn-Bar and Mohan Manubhai Trivedi. 2014. Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Trans. Intell. Transport. Syst. 15, 6 (2014), 2368--2377.Google ScholarCross Ref
- Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. (2013).Google Scholar
- Pichao Wang et al. 2016. Large-scale isolated gesture recognition using convolutional neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 27--12.Google Scholar
- Lionel Pigou, Aaron Van Den Oord, Sander Dieleman, Mieke Van Herreweghe, and Joni Dambre. 2015. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video. Int. J. Comput. Vis. (2015), 1--10. Google ScholarDigital Library
- Amir Shahroudy, Jun Liu, Tiantsong Ng, and Gang Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. (2016).Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. (2014).Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition. Computer Science (2014).Google Scholar
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929--1958. Google ScholarDigital Library
- Thad Starner, Alex Pentland, and Joshua Weaver. 1998. Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Analy. Mach. Intell. 20, 12 (1998), 1371--1375. Google ScholarDigital Library
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. Computer Science (2015).Google Scholar
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. (2015).Google Scholar
- Pedro Trindade, Jorge Lobo, and Joao P. Barreto. 2012. Hand gesture recognition using color and depth images enhanced with hand angular pose data. In Multisensor Fusion and Integration for Intelligent Systems. 71--76.Google Scholar
- J. Wan, V. Athitsos, P. Jangyodsuk, H. J. Escalante, Q. Ruan, and I. Guyon. 2014. CSMMI: Class-specific maximization of mutual information for action and gesture recognition. IEEE Trans. Image Process. 23, 7 (2014), 3152--3165.Google ScholarCross Ref
- Jun Wan, Guodong Guo, and Stan Z Li. 2016. Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38, 8 (2016), 1626--1639.Google ScholarDigital Library
- Jun Wan, Qiuqi Ruan, Wei Li, and Shuang Deng. 2013. One-shot learning gesture recognition from RGB-D data using bag of features. J. Mach. Learn. Res. 14, 1 (2013), 2549--2582. Google ScholarDigital Library
- Jun Wan, Yibing Zhao, Shuai Zhou, Isabelle Guyon, Sergio Escalera, and Stan Z. Li. 2016. ChaLearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16) Workshops.Google Scholar
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. (2013).Google Scholar
- Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. Springer, Berlin.Google Scholar
- Sy Bor Wang, Ariadna Quattoni, Louis Philippe Morency, and David Demirdjian. 2006. Hidden conditional random fields for gesture recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1521--1527. Google ScholarDigital Library
- Andreas Wedel, Thomas Pock, Christopher Zach, Horst Bischof, and Daniel Cremers. 2009. An improved algorithm for TV-L1 optical flow. Lect. Not. Compute. Sci. 5604, 7 (2009), 23--45.Google ScholarDigital Library
- Xiujuan Chai et al. 2016. Two streams recurrent neural networks for large-scale continuous gesture recognition. In 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 31--36.Google Scholar
- Ying Yin and Randall Davis. 2014. Real-time continuous gesture recognition for natural human-computer interaction. In Proceedings of the 2014 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC’14). IEEE, 113--120.Google ScholarCross Ref
- Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Real-time action recognition with enhanced motion vector CNNs.Google Scholar
- Yi Zhu and Shawn Newsam. 2016. Depth2Action: Exploring Embedded Depth for Large-Scale Action Recognition. Springer, Berlin.Google Scholar
Index Terms
- A Unified Framework for Multi-Modal Isolated Gesture Recognition
Recommendations
Multi-modal & Multi-view & Interactive Benchmark Dataset for Human Action Recognition
MM '15: Proceedings of the 23rd ACM international conference on MultimediaHuman action recognition is one of the most active research areas in both computer vision and machine learning communities. Several methods for human action recognition have been proposed in the literature and promising results have been achieved on the ...
Multi-modal Multi-view Topic-opinion Mining for Social Event Analysis
MM '16: Proceedings of the 24th ACM international conference on MultimediaIn this paper, we propose a novel multi-modal multi-view topic-opinion mining (MMTOM) model for social event analysis in multiple collection sources. Compared with existing topic-opinion mining methods, our proposed model has several advantages: (1) The ...
Adaptive facial expression recognition using inter-modal top-down context
ICMI '11: Proceedings of the 13th international conference on multimodal interfacesThe role of context in recognizing a person's affect is being increasingly studied. In particular, context arising from the presence of multi-modal information such as faces, speech and head pose has been used in recent studies to recognize facial ...
Comments