ABSTRACT
This paper addresses the problem of audio-visual command recognition in the framework of the D-META Grand Challenge1. Temporal and non-temporal learning models are trained on visual and auditory descriptors. In order to set a proper baseline, the methods are tested on the "Robot Gestures" scenario of the publicly available RAVEL data set, following the leave-one-out cross-validation strategy. The classification-level audio-visual fusion strategy allows for compensating the errors of the unimodal (audio or vision) classifiers. The obtained results (an average audio-visual recognition rate of almost 80%) encourage us to investigate on how to further develop and improve the methodology described in this paper.
- Xavier Alameda-Pineda, Vasil Khalidov, Radu P. Horaud, and Florence Forbes. Finding audio-visual events in informal social gatherings. In Proceedings of the 13th International Conference on Multimodal Interfaces, pages 247--254, Alicante, Spain, November 2011. ACM. Google ScholarDigital Library
- Xavier Alameda-Pineda, Jordi Sanchez-Riera, Vojtech Franc, Johannes Wienke, Jan Čech, Kaustubh Kulkarni, Antoine Deleforge, and Radu P. Horaud. Ravel: An annotated corpus for training robots with audio visual abilities. Journal of Multimodal User Interfaces, 2012.Google Scholar
- Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. Google ScholarDigital Library
- Mike Brookes. VOICEBOX: Speech processing toolbox for MATLAB. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.Google Scholar
- Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, and Alexander Loui. Short-term audio-visual atoms for generic video concept classification. In Proceedings of the 17th ACM International Conference on Multimedia, 2009. Google ScholarDigital Library
- Vasil Khalidov, Florence Forbes, and Radu P. Horaud. Conjugate mixture models for clustering multimodal data. Neural Computation, 23(2):517--557, February 2011. Google ScholarDigital Library
- L. Lacheze, Y. Guo, R. Benosman, B. Gas, and C. Couverture. Audio/video fusion for objects recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009. Google ScholarDigital Library
- I. Laptev. On space-time interest points. International Journal on Computer Vision, 64(2-3), 2005. Google ScholarDigital Library
- Ming Liu, Yun Fu, and Thomas S. Huang. An audio-visual fusion framework with joint dimensionality reduction. In Proceedings of the IEEE International Conference on Audio Speech and Signal Processing, 2008.Google Scholar
- José Lopes and Sameer Singh. Audio and video feature fusion for activity recognition in unconstrained videos. In Intelligent Data Engineering and Automated Learning, 2006. Google ScholarDigital Library
- Jie Luo, Barbara Caputo, Alon Zweig, Jörg-Hendrik Bach, and Jörn Anemüller. Object category detection using audio-visual cues. In Proceedings of the 6th International Conference on Computer Vision Systems, 2008. Google ScholarDigital Library
- Lawrence R Rabiner and Ronald W Schafer. Theory and Applications of Digital Speech Processing. Pearson, 2011. Google ScholarDigital Library
- V. Ramasubramanian, R. Karthik, S. Thiyagarajan, and Srikanth Cherla. Continuous audio analytics by HMM and viterbi decoding. In Proceedings of the IEEE International Conference on Audio, Speech and Signal Processing, pages 2396--2399. IEEE, 2011.Google ScholarCross Ref
- Kate Saenko and Trevor Darrell. Object category recognition using probabilistic fusion of speech and image classifiers. In Proceedings of the 4th International Conference on Machine Learning for Multimodal Interaction, 2008. Google ScholarDigital Library
- Jordi Sanchez-Riera, Jan Cech, and Radu Horaud. Action recognition robust to background clutter by using stereo vision. In In 4th International Workshop on Video Event Categorization, Tagging and Retrieval (VECTaR), in conjunction with IEEE European Conference on Computer Vision, 2012. Google ScholarDigital Library
- Jan Čech, Jordi Sanchez-Riera, and Radu P. Horaud. Scene flow estimation by growing correspondence seeds. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2011. Google ScholarDigital Library
- Ziyou Xiong. Audio-visual sports highlights extraction using coupled hidden markov models. Pattern Anal. Appl., 8(1-2):62--71, 2005. Google ScholarDigital Library
Index Terms
- Audio-visual robot command recognition: D-META'12 grand challenge
Recommendations
Robot Command Interface Using an Audio-Visual Speech Recognition System
CIARP '09: Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and ApplicationsIn recent years audio-visual speech recognition has emerged as an active field of research thanks to advances in pattern recognition, signal processing and machine vision. Its ultimate goal is to allow human-computer communication using voice, taking ...
Asynchrony modeling for audio-visual speech recognition
HLT '02: Proceedings of the second international conference on Human Language Technology ResearchWe investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. Multi-stream HMMs allow the modeling of asynchrony between the audio and visual state sequences at a variety of levels (phone, syllable, word, etc.) and are ...
Comments