research-article

Audio-visual robot command recognition: D-META'12 grand challenge

Authors:
Jordi Sanchez-Riera

INRIA Rhone Alpes, Montbonnot, France

INRIA Rhone Alpes, Montbonnot, France
View Profile

,
Xavier Alameda-Pineda

INRIA Rhone Alpes, Montbonnot, France

INRIA Rhone Alpes, Montbonnot, France
View Profile

,
Radu Horaud

INRIA Rhone Alpes, Montbonnor, France

INRIA Rhone Alpes, Montbonnor, France
View Profile

ICMI '12: Proceedings of the 14th ACM international conference on Multimodal interactionOctober 2012Pages 371–378https://doi.org/10.1145/2388676.2388760

Published:22 October 2012Publication History

ICMI '12: Proceedings of the 14th ACM international conference on Multimodal interaction

Pages 371–378

ABSTRACT

This paper addresses the problem of audio-visual command recognition in the framework of the D-META Grand Challenge¹. Temporal and non-temporal learning models are trained on visual and auditory descriptors. In order to set a proper baseline, the methods are tested on the "Robot Gestures" scenario of the publicly available RAVEL data set, following the leave-one-out cross-validation strategy. The classification-level audio-visual fusion strategy allows for compensating the errors of the unimodal (audio or vision) classifiers. The obtained results (an average audio-visual recognition rate of almost 80%) encourage us to investigate on how to further develop and improve the methodology described in this paper.

References

Xavier Alameda-Pineda, Vasil Khalidov, Radu P. Horaud, and Florence Forbes. Finding audio-visual events in informal social gatherings. In Proceedings of the 13th International Conference on Multimodal Interfaces, pages 247--254, Alicante, Spain, November 2011. ACM. Google ScholarDigital Library
Xavier Alameda-Pineda, Jordi Sanchez-Riera, Vojtech Franc, Johannes Wienke, Jan Čech, Kaustubh Kulkarni, Antoine Deleforge, and Radu P. Horaud. Ravel: An annotated corpus for training robots with audio visual abilities. Journal of Multimodal User Interfaces, 2012.Google Scholar
Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. Google ScholarDigital Library
Mike Brookes. VOICEBOX: Speech processing toolbox for MATLAB. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.Google Scholar
Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, and Alexander Loui. Short-term audio-visual atoms for generic video concept classification. In Proceedings of the 17th ACM International Conference on Multimedia, 2009. Google ScholarDigital Library
Vasil Khalidov, Florence Forbes, and Radu P. Horaud. Conjugate mixture models for clustering multimodal data. Neural Computation, 23(2):517--557, February 2011. Google ScholarDigital Library
L. Lacheze, Y. Guo, R. Benosman, B. Gas, and C. Couverture. Audio/video fusion for objects recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009. Google ScholarDigital Library
I. Laptev. On space-time interest points. International Journal on Computer Vision, 64(2-3), 2005. Google ScholarDigital Library
Ming Liu, Yun Fu, and Thomas S. Huang. An audio-visual fusion framework with joint dimensionality reduction. In Proceedings of the IEEE International Conference on Audio Speech and Signal Processing, 2008.Google Scholar
José Lopes and Sameer Singh. Audio and video feature fusion for activity recognition in unconstrained videos. In Intelligent Data Engineering and Automated Learning, 2006. Google ScholarDigital Library
Jie Luo, Barbara Caputo, Alon Zweig, Jörg-Hendrik Bach, and Jörn Anemüller. Object category detection using audio-visual cues. In Proceedings of the 6th International Conference on Computer Vision Systems, 2008. Google ScholarDigital Library
Lawrence R Rabiner and Ronald W Schafer. Theory and Applications of Digital Speech Processing. Pearson, 2011. Google ScholarDigital Library
V. Ramasubramanian, R. Karthik, S. Thiyagarajan, and Srikanth Cherla. Continuous audio analytics by HMM and viterbi decoding. In Proceedings of the IEEE International Conference on Audio, Speech and Signal Processing, pages 2396--2399. IEEE, 2011.Google ScholarCross Ref
Kate Saenko and Trevor Darrell. Object category recognition using probabilistic fusion of speech and image classifiers. In Proceedings of the 4th International Conference on Machine Learning for Multimodal Interaction, 2008. Google ScholarDigital Library
Jordi Sanchez-Riera, Jan Cech, and Radu Horaud. Action recognition robust to background clutter by using stereo vision. In In 4th International Workshop on Video Event Categorization, Tagging and Retrieval (VECTaR), in conjunction with IEEE European Conference on Computer Vision, 2012. Google ScholarDigital Library
Jan Čech, Jordi Sanchez-Riera, and Radu P. Horaud. Scene flow estimation by growing correspondence seeds. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2011. Google ScholarDigital Library
Ziyou Xiong. Audio-visual sports highlights extraction using coupled hidden markov models. Pattern Anal. Appl., 8(1-2):62--71, 2005. Google ScholarDigital Library

Index Terms

Audio-visual robot command recognition: D-META'12 grand challenge
1. Information systems
  1. Information systems applications

Recommendations

Robot Command Interface Using an Audio-Visual Speech Recognition System
CIARP '09: Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications

In recent years audio-visual speech recognition has emerged as an active field of research thanks to advances in pattern recognition, signal processing and machine vision. Its ultimate goal is to allow human-computer communication using voice, taking ...
Read More
Audio-visual speech recognition for difficult environments
Read More
Asynchrony modeling for audio-visual speech recognition
HLT '02: Proceedings of the second international conference on Human Language Technology Research

We investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. Multi-stream HMMs allow the modeling of asynchrony between the audio and visual state sequences at a variety of levels (phone, syllable, word, etc.) and are ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '12: Proceedings of the 14th ACM international conference on Multimodal interaction
October 2012
636 pages
ISBN:9781450314671
DOI:10.1145/2388676
General Chairs:
Louis-Philippe Morency
University of Southern California, USA
,
Dan Bohus
Microsoft Research, USA
,
Hamid Aghajan
Stanford University, USA
,
Program Chairs:
Justine Cassell
Carnegie Mellon University, USA
,
Anton Nijholt
University of Twente, Netherlands
,
Julien Epps
The University of New South Wales, Australia
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 October 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
audio-visual categorization
multimodal learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate453of1,080submissions,42%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 70
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Audio-visual robot command recognition: D-META'12 grand challenge

ICMI '12: Proceedings of the 14th ACM international conference on Multimodal interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Robot Command Interface Using an Audio-Visual Speech Recognition System

Audio-visual speech recognition for difficult environments

Asynchrony modeling for audio-visual speech recognition