skip to main content
research-article

A Unified Framework for Multi-Modal Isolated Gesture Recognition

Authors Info & Claims
Published:21 February 2018Publication History
Skip Abstract Section

Abstract

In this article, we focus on isolated gesture recognition and explore different modalities by involving RGB stream, depth stream, and saliency stream for inspection. Our goal is to push the boundary of this realm even further by proposing a unified framework that exploits the advantages of multi-modality fusion. Specifically, a spatial-temporal network architecture based on consensus-voting has been proposed to explicitly model the long-term structure of the video sequence and to reduce estimation variance when confronted with comprehensive inter-class variations. In addition, a three-dimensional depth-saliency convolutional network is aggregated in parallel to capture subtle motion characteristics. Extensive experiments are done to analyze the performance of each component and our proposed approach achieves the best results on two public benchmarks, ChaLearn IsoGD and RGBD-HuDaAct, outperforming the closest competitor by a margin of over 10% and 15%, respectively. Our project and codes will be released at https://davidsonic.github.io/index/acm_tomm_2017.html.

Skip Supplemental Material Section

Supplemental Material

References

  1. Radhakrishna Achanta, Sheila S. Hemami, Francisco Estrada, and Sabine Susstrunk. 2009. Frequency-tuned salient region detection. (2009).Google ScholarGoogle Scholar
  2. N. H. Dardas and Nicolas D. Georganas. 2011. Real-time hand gesture detection and recognition using bag-of-features and support vector machine techniques. IEEE Trans. Instrum. Meas. 60, 11 (2011), 3592--3607.Google ScholarGoogle ScholarCross RefCross Ref
  3. James W. Davis and Aaron F. Bobick. 2000. The representation and recognition of action using temporal templates. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’00). 928--934. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jia Deng, Wei Dong, R. Socher, Li Jia Li, Kai Li, and Fei Fei Li. 2009. ImageNet: A large-scale hierarchical image database. 248--255.Google ScholarGoogle Scholar
  5. Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description. (2015).Google ScholarGoogle Scholar
  6. Wan Jun Escalante H. J. and Ponce-Lpez V. 2016a. Chalearn joint contest on multimedia challenges beyond visual analysis: An overview. In Proceedings of the International Conference on Pattern Recognition (2016).Google ScholarGoogle Scholar
  7. Wan Jun Escalante H. J. and Ponce-Lpez V. 2016b. ChaLearn LAP Large-scale Isolated Gesture Recognition Challenge. Retreived from https://competitions.codalab.org/competitions/10331.Google ScholarGoogle Scholar
  8. Georgios D. Evangelidis, Gurkirt Singh, and Radu Horaud. 2014. Continuous gesture recognition from articulated poses. In Proceedings of theWorkshop at the European Conference on Computer Vision. Springer, 595--607.Google ScholarGoogle Scholar
  9. Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016a. Spatiotemporal residual networks for video action recognition. In Advances in Neural Information Processing Systems. 3468--3476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016b. Convolutional two-stream network fusion for video action recognition.Google ScholarGoogle Scholar
  11. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Computer Science (2015).Google ScholarGoogle Scholar
  12. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv Preprint arXiv:1408.5093 (2014).Google ScholarGoogle Scholar
  13. Qiuqi Ruan Jun Wan and Wei Li. 2014. 3D SMoSIFT: Three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos. J. Electron. Imag. 23, 2 (2014), 1709--1717.Google ScholarGoogle Scholar
  14. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Feifei. 2014. Large-scale video classification with convolutional neural networks. (2014).Google ScholarGoogle Scholar
  15. I. Laptev and T. Lindeberg. 2003. Space-time interest points. In Proceedings of the International Conference on Computer Vision. 432--439. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  17. Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. 2016. Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle ScholarCross RefCross Ref
  18. Pavlo Molchanov, Shalini Gupta, Kihwan Kim, and Kari Pulli. 2015. Multi-sensor system for driver’s hand-gesture recognition. Proceedings of the 11th IEEE International Conference and Workshops Automatic Face and Gesture Recognition (FG’15). 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  19. Natalia Neverova, Christian Wolf, Graham W. Taylor, and Florian Nebout. 2015. Multi-scale Deep Learning for Gesture Detection and Localization. 474--490.Google ScholarGoogle Scholar
  20. Joe Yuehei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. (2015).Google ScholarGoogle Scholar
  21. Bingbing Ni, Gang Wang, and Pierre Moulin. 2011. RGBD-HuDaAct: A color-depth video database for human daily activity recognition. (2011).Google ScholarGoogle Scholar
  22. Noriki Nishida and Hideki Nakayama. 2015. Multimodal Gesture Recognition Using Multi-stream Recurrent Neural Network. Springer-Verlag, New York, NY.Google ScholarGoogle Scholar
  23. Eshed Ohn-Bar and Mohan Manubhai Trivedi. 2014. Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Trans. Intell. Transport. Syst. 15, 6 (2014), 2368--2377.Google ScholarGoogle ScholarCross RefCross Ref
  24. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. (2013).Google ScholarGoogle Scholar
  25. Pichao Wang et al. 2016. Large-scale isolated gesture recognition using convolutional neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 27--12.Google ScholarGoogle Scholar
  26. Lionel Pigou, Aaron Van Den Oord, Sander Dieleman, Mieke Van Herreweghe, and Joni Dambre. 2015. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video. Int. J. Comput. Vis. (2015), 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Amir Shahroudy, Jun Liu, Tiantsong Ng, and Gang Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. (2016).Google ScholarGoogle Scholar
  28. Karen Simonyan and Andrew Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. (2014).Google ScholarGoogle Scholar
  29. Karen Simonyan and Andrew Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition. Computer Science (2014).Google ScholarGoogle Scholar
  30. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929--1958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Thad Starner, Alex Pentland, and Joshua Weaver. 1998. Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Analy. Mach. Intell. 20, 12 (1998), 1371--1375. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. Computer Science (2015).Google ScholarGoogle Scholar
  33. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. (2015).Google ScholarGoogle Scholar
  34. Pedro Trindade, Jorge Lobo, and Joao P. Barreto. 2012. Hand gesture recognition using color and depth images enhanced with hand angular pose data. In Multisensor Fusion and Integration for Intelligent Systems. 71--76.Google ScholarGoogle Scholar
  35. J. Wan, V. Athitsos, P. Jangyodsuk, H. J. Escalante, Q. Ruan, and I. Guyon. 2014. CSMMI: Class-specific maximization of mutual information for action and gesture recognition. IEEE Trans. Image Process. 23, 7 (2014), 3152--3165.Google ScholarGoogle ScholarCross RefCross Ref
  36. Jun Wan, Guodong Guo, and Stan Z Li. 2016. Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38, 8 (2016), 1626--1639.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jun Wan, Qiuqi Ruan, Wei Li, and Shuang Deng. 2013. One-shot learning gesture recognition from RGB-D data using bag of features. J. Mach. Learn. Res. 14, 1 (2013), 2549--2582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jun Wan, Yibing Zhao, Shuai Zhou, Isabelle Guyon, Sergio Escalera, and Stan Z. Li. 2016. ChaLearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16) Workshops.Google ScholarGoogle Scholar
  39. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. (2013).Google ScholarGoogle Scholar
  40. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. Springer, Berlin.Google ScholarGoogle Scholar
  41. Sy Bor Wang, Ariadna Quattoni, Louis Philippe Morency, and David Demirdjian. 2006. Hidden conditional random fields for gesture recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1521--1527. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Andreas Wedel, Thomas Pock, Christopher Zach, Horst Bischof, and Daniel Cremers. 2009. An improved algorithm for TV-L1 optical flow. Lect. Not. Compute. Sci. 5604, 7 (2009), 23--45.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xiujuan Chai et al. 2016. Two streams recurrent neural networks for large-scale continuous gesture recognition. In 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 31--36.Google ScholarGoogle Scholar
  44. Ying Yin and Randall Davis. 2014. Real-time continuous gesture recognition for natural human-computer interaction. In Proceedings of the 2014 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC’14). IEEE, 113--120.Google ScholarGoogle ScholarCross RefCross Ref
  45. Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Real-time action recognition with enhanced motion vector CNNs.Google ScholarGoogle Scholar
  46. Yi Zhu and Shawn Newsam. 2016. Depth2Action: Exploring Embedded Depth for Large-Scale Action Recognition. Springer, Berlin.Google ScholarGoogle Scholar

Index Terms

  1. A Unified Framework for Multi-Modal Isolated Gesture Recognition

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 14, Issue 1s
        Special Section on Representation, Analysis and Recognition of 3D Humans and Special Section on Multimedia Computing and Applications of Socio-Affective Behaviors in the Wild
        March 2018
        234 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3190503
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 February 2018
        • Accepted: 1 August 2017
        • Revised: 1 June 2017
        • Received: 1 January 2017
        Published in tomm Volume 14, Issue 1s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader