skip to main content
10.1145/1452392.1452438acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Detection and localization of 3d audio-visual objects using unsupervised clustering

Published: 20 October 2008 Publication History

Abstract

This paper addresses the issues of detecting and localizing objects in a scene that are both seen and heard. We explain the benefits of a human-like configuration of sensors (binaural and binocular) for gathering auditory and visual observations. It is shown that the detection and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data into a common audio-visual 3D representation via a pair of mixture models. Inference is performed by a version of the expectation-maximization algorithm, which is formally derived, and which provides cooperative estimates of both the auditory activity and the 3D position of each object. We describe several experiments with single- and multiple-speaker detection and localization, in the presence of other audio sources.

References

[1]
M. Heckmann, F. Berthommier, and K. Kroschel. Noise adaptive stream weighting in audio-visual speech recognition. EURASIP J. Applied Signal Proc., 11:1260--1273, 2002.
[2]
M. Beal, N. Jojic, and H. Attias. A graphical model for audiovisual object tracking. IEEE Trans. PAMI, 25(7):828--836, 2003.
[3]
A. Kushal, M. Rahurkar, L. Fei-Fei, J. Ponce, and T. Huang. Audio-visual speaker localization using graphical models. In Proc. 18th ICPR., pages 291--294, 2006.
[4]
D. N. Zotkin, R. Duraiswami, and L. S. Davis. Joint audio-visual tracking using particle filters. EURASIP Journal on Applied Signal Processing, 11:1154--1164, 2002.
[5]
J. Vermaak, M. Ganget, A. Blake, and P. Pérez. Sequential monte carlo fusion of sound and vision for speaker tracking. In Proc. IEEE ICCV, pages 741--746, 2001.
[6]
P. Perez, J. Vermaak, and A. Blake. Data fusion for visual tracking with particles. Proc. of IEEE, 92(3):495--513, 2004.
[7]
Y. Chen and Y. Rui. Real-time speaker tracking using particle filter sensor fusion. Proc. of IEEE, 92(3):485--494, 2004.
[8]
K. Nickel, T. Gehrig, R. Stiefelhagen, and J. McDonough. A joint particle filter for audio-visual speaker tracking. In Proc. 7th International Conference on Multimodal Interfaces, pages 61--68, 2005.
[9]
T. Hospedales, J. Cartwright, and S. Vijayakumar. Structure inference for bayesian multisensory perception and tracking. In Proc. International Joint Conference on Artificial Intelligence, pages 2122--2128, 2007.
[10]
N. Checka, K. Wilson, M. Siracusa, and T. Darrell. Multiple person and speaker activity tracking with a particle filter. In IEEE Conf. Acoust. Sp. Sign. Proc., pages 881--884, 2004.
[11]
D. Gatica-Perez, G. Lathoud, J.-M. Odobez, and I. McCowan. Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE Trans. on ASLP, 15(2):601--616, 2007.
[12]
K. Bernardin and R. Stiefelhagen. Audio-visual multi-person tracking and identification for smart environments. In Proc. 15th International ACM Conference on Multimedia, pages 661--670, 2007.
[13]
R. Brunelli, A. Brutti, P. Chippendale, O. Lanz, M. Omologo, P. Svaizer, and F. Tobia. A generative approach to audio-visual person tracking. In Multimodal Technologies for Perception of Humans: Proc. 1st International Evaluation Workshop, pages 55--68, 2007.
[14]
J. Fisher and T. Darrell. Speaker association with signal-level audiovisual fusion. IEEE Trans. on Multimedia, 6(3):406--413, 2004.
[15]
Z. Barzelay and Y. Y. Schechner. Harmony in motion. In Proc. of IEEE CVPR, pages 1--8, 2007.
[16]
M. Hansard and R. P. Horaud. Patterns of binocular disparity for a fixating observer. In Advances in Brain, Vision, & AI, 2nd Int. Symp., pages 308--317. Springer, 2007.
[17]
J. R. Movellan and G. Chadderdon. Channel separability in the audio-visual integration of speech: A Bayesian approach. In D. G. Stork and M. E. Hennecke, editors, Speech Reading by Humans and Machines: Models, Systems and Applications, NATO ASI Series, pages 473--487. Springer, Berlin, 1996.
[18]
D. W. Massaro and D. G. Stork. Speech recognition and sensory integration. American Scientist, 86(3):236--244, 1998.
[19]
G. Celeux, F. Forbes, and N. Peyrard. EM procedures using mean-field approximations for Markov model-based image segmentation. Pattern Recognition, 36:131--144, 2003.
[20]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B, 39(1):1--38, 1977.
[21]
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[22]
G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461--464, March 1978.
[23]
E. Arnaud, H. Christensen, Y. C. Lu, J. Barker, V. Khalidov, M. Hansard, B. Holveck, H. Mathieu, R. Narasimha, F. Forbes, and R. Horaud. The CAVA corpus: Synchronized stereoscopic and binaural datasets with head movements. In Proc. of ICMI 2008, 2008.
[24]
C. Harris and M. Stephens. A combined corner and edge detector. In Proc. 4th Alvey Vision Conference, pages 147--151, 1988.
[25]
Intel OpenCV Computer Vision library. http://www.intel.com/technology/computing/opencv.
[26]
H. Christensen, N. Ma, S. N. Wrigley, and J. Barker. Integrating pitch and localisation cues at a speech fragment level. In Proc. of Interspeech 2007, pages 2769--2772, 2007.

Cited By

View all
  • (2015)A joint audio-visual approach to audio localization2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2015.7178010(454-458)Online publication date: Apr-2015
  • (2014)Vision-guided robot hearingThe International Journal of Robotics Research10.1177/027836491454805034:4-5(437-456)Online publication date: 27-Oct-2014
  • (2014)Audio-visual speaker localization via weighted clustering2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP)10.1109/MLSP.2014.6958874(1-6)Online publication date: Sep-2014
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '08: Proceedings of the 10th international conference on Multimodal interfaces
October 2008
322 pages
ISBN:9781605581989
DOI:10.1145/1452392
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. audio-visual clustering
  2. binaural hearing
  3. mixture models
  4. stereo vision

Qualifiers

  • Research-article

Conference

ICMI '08
Sponsor:
ICMI '08: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES
October 20 - 22, 2008
Crete, Chania, Greece

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2015)A joint audio-visual approach to audio localization2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2015.7178010(454-458)Online publication date: Apr-2015
  • (2014)Vision-guided robot hearingThe International Journal of Robotics Research10.1177/027836491454805034:4-5(437-456)Online publication date: 27-Oct-2014
  • (2014)Audio-visual speaker localization via weighted clustering2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP)10.1109/MLSP.2014.6958874(1-6)Online publication date: Sep-2014
  • (2011)Finding audio-visual events in informal social gatheringsProceedings of the 13th international conference on multimodal interfaces10.1145/2070481.2070527(247-254)Online publication date: 14-Nov-2011
  • (2008)The CAVA corpusProceedings of the 10th international conference on Multimodal interfaces10.1145/1452392.1452414(109-116)Online publication date: 20-Oct-2008

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media