skip to main content
10.1145/2388676.2388760acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Audio-visual robot command recognition: D-META'12 grand challenge

Published:22 October 2012Publication History

ABSTRACT

This paper addresses the problem of audio-visual command recognition in the framework of the D-META Grand Challenge1. Temporal and non-temporal learning models are trained on visual and auditory descriptors. In order to set a proper baseline, the methods are tested on the "Robot Gestures" scenario of the publicly available RAVEL data set, following the leave-one-out cross-validation strategy. The classification-level audio-visual fusion strategy allows for compensating the errors of the unimodal (audio or vision) classifiers. The obtained results (an average audio-visual recognition rate of almost 80%) encourage us to investigate on how to further develop and improve the methodology described in this paper.

References

  1. Xavier Alameda-Pineda, Vasil Khalidov, Radu P. Horaud, and Florence Forbes. Finding audio-visual events in informal social gatherings. In Proceedings of the 13th International Conference on Multimodal Interfaces, pages 247--254, Alicante, Spain, November 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Xavier Alameda-Pineda, Jordi Sanchez-Riera, Vojtech Franc, Johannes Wienke, Jan Čech, Kaustubh Kulkarni, Antoine Deleforge, and Radu P. Horaud. Ravel: An annotated corpus for training robots with audio visual abilities. Journal of Multimodal User Interfaces, 2012.Google ScholarGoogle Scholar
  3. Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Mike Brookes. VOICEBOX: Speech processing toolbox for MATLAB. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.Google ScholarGoogle Scholar
  5. Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, and Alexander Loui. Short-term audio-visual atoms for generic video concept classification. In Proceedings of the 17th ACM International Conference on Multimedia, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Vasil Khalidov, Florence Forbes, and Radu P. Horaud. Conjugate mixture models for clustering multimodal data. Neural Computation, 23(2):517--557, February 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Lacheze, Y. Guo, R. Benosman, B. Gas, and C. Couverture. Audio/video fusion for objects recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. I. Laptev. On space-time interest points. International Journal on Computer Vision, 64(2-3), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ming Liu, Yun Fu, and Thomas S. Huang. An audio-visual fusion framework with joint dimensionality reduction. In Proceedings of the IEEE International Conference on Audio Speech and Signal Processing, 2008.Google ScholarGoogle Scholar
  10. José Lopes and Sameer Singh. Audio and video feature fusion for activity recognition in unconstrained videos. In Intelligent Data Engineering and Automated Learning, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jie Luo, Barbara Caputo, Alon Zweig, Jörg-Hendrik Bach, and Jörn Anemüller. Object category detection using audio-visual cues. In Proceedings of the 6th International Conference on Computer Vision Systems, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lawrence R Rabiner and Ronald W Schafer. Theory and Applications of Digital Speech Processing. Pearson, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. V. Ramasubramanian, R. Karthik, S. Thiyagarajan, and Srikanth Cherla. Continuous audio analytics by HMM and viterbi decoding. In Proceedings of the IEEE International Conference on Audio, Speech and Signal Processing, pages 2396--2399. IEEE, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  14. Kate Saenko and Trevor Darrell. Object category recognition using probabilistic fusion of speech and image classifiers. In Proceedings of the 4th International Conference on Machine Learning for Multimodal Interaction, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jordi Sanchez-Riera, Jan Cech, and Radu Horaud. Action recognition robust to background clutter by using stereo vision. In In 4th International Workshop on Video Event Categorization, Tagging and Retrieval (VECTaR), in conjunction with IEEE European Conference on Computer Vision, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jan Čech, Jordi Sanchez-Riera, and Radu P. Horaud. Scene flow estimation by growing correspondence seeds. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ziyou Xiong. Audio-visual sports highlights extraction using coupled hidden markov models. Pattern Anal. Appl., 8(1-2):62--71, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Audio-visual robot command recognition: D-META'12 grand challenge

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICMI '12: Proceedings of the 14th ACM international conference on Multimodal interaction
      October 2012
      636 pages
      ISBN:9781450314671
      DOI:10.1145/2388676

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 October 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate453of1,080submissions,42%
    • Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader