research-article

Detection and localization of 3d audio-visual objects using unsupervised clustering

Authors:

Vasil Khalidov,

Florence Forbes,

Radu HoraudAuthors Info & Claims

ICMI '08: Proceedings of the 10th international conference on Multimodal interfaces

Pages 217 - 224

https://doi.org/10.1145/1452392.1452438

Published: 20 October 2008 Publication History

Abstract

This paper addresses the issues of detecting and localizing objects in a scene that are both seen and heard. We explain the benefits of a human-like configuration of sensors (binaural and binocular) for gathering auditory and visual observations. It is shown that the detection and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data into a common audio-visual 3D representation via a pair of mixture models. Inference is performed by a version of the expectation-maximization algorithm, which is formally derived, and which provides cooperative estimates of both the auditory activity and the 3D position of each object. We describe several experiments with single- and multiple-speaker detection and localization, in the presence of other audio sources.

References

[1]

M. Heckmann, F. Berthommier, and K. Kroschel. Noise adaptive stream weighting in audio-visual speech recognition. EURASIP J. Applied Signal Proc., 11:1260--1273, 2002.

Digital Library

[2]

M. Beal, N. Jojic, and H. Attias. A graphical model for audiovisual object tracking. IEEE Trans. PAMI, 25(7):828--836, 2003.

Digital Library

[3]

A. Kushal, M. Rahurkar, L. Fei-Fei, J. Ponce, and T. Huang. Audio-visual speaker localization using graphical models. In Proc. 18th ICPR., pages 291--294, 2006.

Digital Library

[4]

D. N. Zotkin, R. Duraiswami, and L. S. Davis. Joint audio-visual tracking using particle filters. EURASIP Journal on Applied Signal Processing, 11:1154--1164, 2002.

Digital Library

[5]

J. Vermaak, M. Ganget, A. Blake, and P. Pérez. Sequential monte carlo fusion of sound and vision for speaker tracking. In Proc. IEEE ICCV, pages 741--746, 2001.

[6]

P. Perez, J. Vermaak, and A. Blake. Data fusion for visual tracking with particles. Proc. of IEEE, 92(3):495--513, 2004.

[7]

Y. Chen and Y. Rui. Real-time speaker tracking using particle filter sensor fusion. Proc. of IEEE, 92(3):485--494, 2004.

[8]

K. Nickel, T. Gehrig, R. Stiefelhagen, and J. McDonough. A joint particle filter for audio-visual speaker tracking. In Proc. 7th International Conference on Multimodal Interfaces, pages 61--68, 2005.

Digital Library

[9]

T. Hospedales, J. Cartwright, and S. Vijayakumar. Structure inference for bayesian multisensory perception and tracking. In Proc. International Joint Conference on Artificial Intelligence, pages 2122--2128, 2007.

Digital Library

[10]

N. Checka, K. Wilson, M. Siracusa, and T. Darrell. Multiple person and speaker activity tracking with a particle filter. In IEEE Conf. Acoust. Sp. Sign. Proc., pages 881--884, 2004.

[11]

D. Gatica-Perez, G. Lathoud, J.-M. Odobez, and I. McCowan. Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE Trans. on ASLP, 15(2):601--616, 2007.

Digital Library

[12]

K. Bernardin and R. Stiefelhagen. Audio-visual multi-person tracking and identification for smart environments. In Proc. 15th International ACM Conference on Multimedia, pages 661--670, 2007.

Digital Library

[13]

R. Brunelli, A. Brutti, P. Chippendale, O. Lanz, M. Omologo, P. Svaizer, and F. Tobia. A generative approach to audio-visual person tracking. In Multimodal Technologies for Perception of Humans: Proc. 1st International Evaluation Workshop, pages 55--68, 2007.

Digital Library

[14]

J. Fisher and T. Darrell. Speaker association with signal-level audiovisual fusion. IEEE Trans. on Multimedia, 6(3):406--413, 2004.

Digital Library

[15]

Z. Barzelay and Y. Y. Schechner. Harmony in motion. In Proc. of IEEE CVPR, pages 1--8, 2007.

[16]

M. Hansard and R. P. Horaud. Patterns of binocular disparity for a fixating observer. In Advances in Brain, Vision, & AI, 2nd Int. Symp., pages 308--317. Springer, 2007.

Digital Library

[17]

J. R. Movellan and G. Chadderdon. Channel separability in the audio-visual integration of speech: A Bayesian approach. In D. G. Stork and M. E. Hennecke, editors, Speech Reading by Humans and Machines: Models, Systems and Applications, NATO ASI Series, pages 473--487. Springer, Berlin, 1996.

[18]

D. W. Massaro and D. G. Stork. Speech recognition and sensory integration. American Scientist, 86(3):236--244, 1998.

[19]

G. Celeux, F. Forbes, and N. Peyrard. EM procedures using mean-field approximations for Markov model-based image segmentation. Pattern Recognition, 36:131--144, 2003.

[20]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B, 39(1):1--38, 1977.

[21]

C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

Digital Library

[22]

G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461--464, March 1978.

[23]

E. Arnaud, H. Christensen, Y. C. Lu, J. Barker, V. Khalidov, M. Hansard, B. Holveck, H. Mathieu, R. Narasimha, F. Forbes, and R. Horaud. The CAVA corpus: Synchronized stereoscopic and binaural datasets with head movements. In Proc. of ICMI 2008, 2008.

Digital Library

[24]

C. Harris and M. Stephens. A combined corner and edge detector. In Proc. 4th Alvey Vision Conference, pages 147--151, 1988.

[25]

Intel OpenCV Computer Vision library. http://www.intel.com/technology/computing/opencv.

[26]

H. Christensen, N. Ma, S. N. Wrigley, and J. Barker. Integrating pitch and localisation cues at a speech fragment level. In Proc. of Interspeech 2007, pages 2769--2772, 2007.

Cited By

Jensen JChristensen M(2015)A joint audio-visual approach to audio localization2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2015.7178010(454-458)Online publication date: Apr-2015
https://doi.org/10.1109/ICASSP.2015.7178010
Alameda-Pineda XHoraud R(2014)Vision-guided robot hearingThe International Journal of Robotics Research10.1177/027836491454805034:4-5(437-456)Online publication date: 27-Oct-2014
https://doi.org/10.1177/0278364914548050
Gebru IAlameda-Pineda XHoraud RForbes F(2014)Audio-visual speaker localization via weighted clustering2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP)10.1109/MLSP.2014.6958874(1-6)Online publication date: Sep-2014
https://doi.org/10.1109/MLSP.2014.6958874
Show More Cited By

Index Terms

Detection and localization of 3d audio-visual objects using unsupervised clustering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Vision for robotics
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

A novel biologically inspired neural network solution for robotic 3D sound source sensing

This paper presents a novel real-time robotic binaural sound localization method based on hierarchical fuzzy artificial neural networks and a generic set of head related transfer functions. The robot is a humanoid equipped with the KEMAR artificial head ...
A Probabilistic Model for Binaural Sound Localization

This paper proposes a biologically inspired and technically implemented sound localization system to robustly estimate the position of a sound source in the frontal azimuthal half-plane. For localization, binaural cues are extracted using cochleagrams ...
Locating virtual sound sources at arbitrary distances in real-time binaural reproduction

A real-time system for sound spatialization via headphones is presented. Conventional headphone spatialization techniques effectively place sources on the surface of a virtual sphere around the listener. In the new system, sources can be spatialized at ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '08: Proceedings of the 10th international conference on Multimodal interfaces

October 2008

322 pages

ISBN:9781605581989

DOI:10.1145/1452392

General Chairs:
Vassilis Digalakis
TU Crete, Greece
,
Alex Potamianos
TU Crete, Greece
,
Matthew Turk
UC Santa Barbara, USA
,
Program Chairs:
Roberto Pieraccini
SpeechCycle, USA
,
Yuri Ivanov
MERL Research, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICMI '08

Sponsor:

ICMI '08: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES

October 20 - 22, 2008

Crete, Chania, Greece

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
249
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jensen JChristensen M(2015)A joint audio-visual approach to audio localization2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2015.7178010(454-458)Online publication date: Apr-2015
https://doi.org/10.1109/ICASSP.2015.7178010
Alameda-Pineda XHoraud R(2014)Vision-guided robot hearingThe International Journal of Robotics Research10.1177/027836491454805034:4-5(437-456)Online publication date: 27-Oct-2014
https://doi.org/10.1177/0278364914548050
Gebru IAlameda-Pineda XHoraud RForbes F(2014)Audio-visual speaker localization via weighted clustering2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP)10.1109/MLSP.2014.6958874(1-6)Online publication date: Sep-2014
https://doi.org/10.1109/MLSP.2014.6958874
Alameda-Pineda XKhalidov VHoraud RForbes FBourlard HHuang TVidal EGatica-Perez DMorency LSebe N(2011)Finding audio-visual events in informal social gatheringsProceedings of the 13th international conference on multimodal interfaces10.1145/2070481.2070527(247-254)Online publication date: 14-Nov-2011
https://dl.acm.org/doi/10.1145/2070481.2070527
Arnaud EChristensen HLu YBarker JKhalidov VHansard MHolveck BMathieu HNarasimha RTaillant EForbes FHoraud RDigalakis VPotamianos ATurk MPieraccini RIvanov Y(2008)The CAVA corpusProceedings of the 10th international conference on Multimodal interfaces10.1145/1452392.1452414(109-116)Online publication date: 20-Oct-2008
https://dl.acm.org/doi/10.1145/1452392.1452414

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten