skip to main content
10.1145/1088463.1088477acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
Article

A joint particle filter for audio-visual speaker tracking

Published: 04 October 2005 Publication History

Abstract

In this paper, we present a novel approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. Computationally expensive features are evaluated only at the particles' projected positions in the respective camera images, thus the complexity of the proposed algorithm is low. We evaluated the system on data that was recorded during actual lectures. The results of our experiments were 36 cm average error for video only tracking, 46 cm for audio only, and 31 cm for the combined audio-video system.

References

[1]
M. S. Brandstein. A framework for speech source localization using sensor arrays. PhD thesis, Brown University, Providence, RI, May 1995.
[2]
M. S. Brandstein, J. E. Adcock, and H. F. Silverman. A closed-form location estimator for use with room environment microphone arrays. IEEE Trans. Speech Audio Proc., 5(1):45--50, January 1997.
[3]
Y. T. Chan and K. C. Ho. A simple and efficient estimator for hyperbolic location. IEEE Trans. Signal Proc., 42(8):1905--15, August 1994.
[4]
N. Checka, K. Wilson, V. Rangarajan, and T. Darrell. A probabilistic framework for multi-modal multi-person tracking. In IEEE Workshop on Multi-Object Tracking (in conjunction with CVPR), 2003.
[5]
J. Chen, J. Benesty, and Y. A. Huang. Robust time delay estimation exploiting redundancy among multiple microphones. IEEE Trans. Speech Audio Proc., 11(6):549--57, November 2003.
[6]
D. Focken and R. Stiefelhagen. Towards vision-based 3-D people tracking in a smart room. In IEEE Int. Conf. Multimodal Interfaces, October 2002.
[7]
D. Gatica-Perez, G. Lathoud, I. McCowan, and J.-M. Odobez. A mixed-state i-particle filter for multi-camera speaker tracking. In Proc. IEEE ICCV Workshop on Multimedia Technologies in E-Learning and Collaboration (ICCV-WOMTEC), 2003.
[8]
T. Gehrig, K. Nickel, H. K. Ekenel, U. Klee, and J. McDonough. Kalman filters for audio-video source localization. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, to appear Oct. 2005.
[9]
Yiteng Huang, Jacob Benesty, Gary W. Elko, and Russell M. Mersereau. Real-time passive source localization: A practical linear-correction least-squares approach. IEEE Trans. Speech Audio Proc., 9(8):943--956, November 2001.
[10]
M. Isard and A. Blake. Condensation--conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1):5--28, 1998.
[11]
U. Klee, T. Gehrig, and J. McDonough. Kalman filters for time delay of arrival-based source localization. EURASIP Special Issue on Multichannel Speech Processing, submitted for publication.
[12]
H. Kruppa, M. Castrillon-Santana, and B. Schiele. Fast and robust face finding via local context. In IEEE Intl. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, October 2003.
[13]
R. Lienhart and J. Maydt. An extended set of haar-like features for rapid object detection. In ICIP, volume 1, pages 900--903, September 2002.
[14]
Q. Liu, Y. Rui, A. Gupta, and J.J. Cadiz. Automatic camera management for lecture room environments. In Proc of SIGCHI'01, pages 442--449, 2001.
[15]
I. Mikic, S. Santini, and R. Jain. Tracking objects in 3d using multiple camera views. In ACCV, 2000.
[16]
M. Omologo and P. Svaizer. Acoustic event localization using a crosspower-spectrum phase based technique. Proc. ICASSP, II:273--6, 1994.
[17]
P. Perez, C. Hue, J. Vermaak, and M. Gangnet. Color-based probabilistic tracking. In Proc. of ECCV 2002, pages 661--675, 2002.
[18]
G. Potamianos, C. Neti, and S. Deligne. Joint audio-visual speech processing for recognition and enhancement. In Proc. Work. Audio-Visual Speech Processing, pages 95--104, September 2003.
[19]
H. C. Schau and A. Z. Robinson. Passive source localization employing intersecting spherical surfaces from time-of-arrival differences. IEEE Trans. Acoust. Speech Signal Proc., ASSP-35(8):1223--5, August 1987.
[20]
J. O. Smith and J. S. Abel. Closed-form least-squares source location estimation from range-difference measurements. IEEE Trans. Acoust. Speech Signal Proc., ASSP-35(12):1661--9, December 1987.
[21]
C. Stauffer and W.E.L. Grimson. Adaptive background mixture models for real-time tracking. In CVPR, pages 246--252, 1999.
[22]
J. Vermaak, M. Gangnet, A. Blake, and P. Pérez. Sequential monte carlo fusion of sound and vision for speaker tracking. In Proc. IEEE Intl. Conf. on Computer Vision, volume 1, pages 741--746, 2001.
[23]
P. Viola and M. Jones. Robust real-time object detection. In ICCV Workshop on Statistical and Computation Theories of Vision, July 2001.
[24]
D. B. Ward, E. A. Lehmann, and R. C. Williamson. Particle filtering algorithms for tracking an acoustic source in a reverberant environment. IEEE Trans. Speech Audio Proc., 11(6):826--836, 2003.
[25]
M. Wölfel and J. McDonough. Combining multi-source far distance speech recognition strategies: beamforming, blind channel and confusion network combination. In Interspeech, to appear Sept. 2005.
[26]
C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-time tracking of the human body. IEEE Transaction on Pattern Analysis and Machine Intelligence, 19(7), 1997.
[27]
D. Zotkin, R. Duraiswami, and L. Davis. Joint audio-visual tracking using particle filters. EURASIP journal on Applied Signal Processing, 2002(11), 2002.

Cited By

View all
  • (2024)A closer look at single object tracking under variable hazeMultimedia Tools and Applications10.1007/s11042-024-19997-w83:38(85755-85780)Online publication date: 29-Aug-2024
  • (2022)Audio-Visual Tracking of Concurrent SpeakersIEEE Transactions on Multimedia10.1109/TMM.2021.306180024(942-954)Online publication date: 2022
  • (2022)Applied Affective ComputingundefinedOnline publication date: 25-Jan-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '05: Proceedings of the 7th international conference on Multimodal interfaces
October 2005
344 pages
ISBN:1595930280
DOI:10.1145/1088463
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multimodal systems
  2. particle filters
  3. speaker tracking

Qualifiers

  • Article

Conference

ICMI05
Sponsor:

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A closer look at single object tracking under variable hazeMultimedia Tools and Applications10.1007/s11042-024-19997-w83:38(85755-85780)Online publication date: 29-Aug-2024
  • (2022)Audio-Visual Tracking of Concurrent SpeakersIEEE Transactions on Multimedia10.1109/TMM.2021.306180024(942-954)Online publication date: 2022
  • (2022)Applied Affective ComputingundefinedOnline publication date: 25-Jan-2022
  • (2021)3D Audio-Visual Speaker Tracking with A Novel Particle Filter2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412682(7343-7348)Online publication date: 10-Jan-2021
  • (2020)Audiovisual Speaker Tracking Using Nonlinear Dynamical Systems With Dynamic Stream WeightsIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2020.298097428(1065-1078)Online publication date: 14-Apr-2020
  • (2020)Literature Survey on Multi-Camera System and Its ApplicationIEEE Access10.1109/ACCESS.2020.30245688(172892-172922)Online publication date: 2020
  • (2019)Multi-Speaker Tracking From an Audio–Visual Sensing DeviceIEEE Transactions on Multimedia10.1109/TMM.2019.290248921:10(2576-2588)Online publication date: Oct-2019
  • (2019)3D Audio-Visual Speaker Tracking with A Two-Layer Particle Filter2019 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP.2019.8803117(1955-1959)Online publication date: Sep-2019
  • (2019)A survey of sentiment analysis in social mediaKnowledge and Information Systems10.1007/s10115-018-1236-460:2(617-663)Online publication date: 1-Aug-2019
  • (2019)Speaker-Independent Multimodal Sentiment Analysis for Big DataMultimodal Analytics for Next-Generation Big Data Technologies and Applications10.1007/978-3-319-97598-6_2(13-43)Online publication date: 19-Jul-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media