Article

A joint particle filter for audio-visual speaker tracking

Authors:

Rainer Stiefelhagen,

John McDonoughAuthors Info & Claims

ICMI '05: Proceedings of the 7th international conference on Multimodal interfaces

Pages 61 - 68

https://doi.org/10.1145/1088463.1088477

Published: 04 October 2005 Publication History

Abstract

In this paper, we present a novel approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. Computationally expensive features are evaluated only at the particles' projected positions in the respective camera images, thus the complexity of the proposed algorithm is low. We evaluated the system on data that was recorded during actual lectures. The results of our experiments were 36 cm average error for video only tracking, 46 cm for audio only, and 31 cm for the combined audio-video system.

References

[1]

M. S. Brandstein. A framework for speech source localization using sensor arrays. PhD thesis, Brown University, Providence, RI, May 1995.

Digital Library

[2]

M. S. Brandstein, J. E. Adcock, and H. F. Silverman. A closed-form location estimator for use with room environment microphone arrays. IEEE Trans. Speech Audio Proc., 5(1):45--50, January 1997.

[3]

Y. T. Chan and K. C. Ho. A simple and efficient estimator for hyperbolic location. IEEE Trans. Signal Proc., 42(8):1905--15, August 1994.

Digital Library

[4]

N. Checka, K. Wilson, V. Rangarajan, and T. Darrell. A probabilistic framework for multi-modal multi-person tracking. In IEEE Workshop on Multi-Object Tracking (in conjunction with CVPR), 2003.

[5]

J. Chen, J. Benesty, and Y. A. Huang. Robust time delay estimation exploiting redundancy among multiple microphones. IEEE Trans. Speech Audio Proc., 11(6):549--57, November 2003.

[6]

D. Focken and R. Stiefelhagen. Towards vision-based 3-D people tracking in a smart room. In IEEE Int. Conf. Multimodal Interfaces, October 2002.

Digital Library

[7]

D. Gatica-Perez, G. Lathoud, I. McCowan, and J.-M. Odobez. A mixed-state i-particle filter for multi-camera speaker tracking. In Proc. IEEE ICCV Workshop on Multimedia Technologies in E-Learning and Collaboration (ICCV-WOMTEC), 2003.

[8]

T. Gehrig, K. Nickel, H. K. Ekenel, U. Klee, and J. McDonough. Kalman filters for audio-video source localization. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, to appear Oct. 2005.

[9]

Yiteng Huang, Jacob Benesty, Gary W. Elko, and Russell M. Mersereau. Real-time passive source localization: A practical linear-correction least-squares approach. IEEE Trans. Speech Audio Proc., 9(8):943--956, November 2001.

[10]

M. Isard and A. Blake. Condensation--conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1):5--28, 1998.

Digital Library

[11]

U. Klee, T. Gehrig, and J. McDonough. Kalman filters for time delay of arrival-based source localization. EURASIP Special Issue on Multichannel Speech Processing, submitted for publication.

Digital Library

[12]

H. Kruppa, M. Castrillon-Santana, and B. Schiele. Fast and robust face finding via local context. In IEEE Intl. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, October 2003.

[13]

R. Lienhart and J. Maydt. An extended set of haar-like features for rapid object detection. In ICIP, volume 1, pages 900--903, September 2002.

[14]

Q. Liu, Y. Rui, A. Gupta, and J.J. Cadiz. Automatic camera management for lecture room environments. In Proc of SIGCHI'01, pages 442--449, 2001.

Digital Library

[15]

I. Mikic, S. Santini, and R. Jain. Tracking objects in 3d using multiple camera views. In ACCV, 2000.

[16]

M. Omologo and P. Svaizer. Acoustic event localization using a crosspower-spectrum phase based technique. Proc. ICASSP, II:273--6, 1994.

[17]

P. Perez, C. Hue, J. Vermaak, and M. Gangnet. Color-based probabilistic tracking. In Proc. of ECCV 2002, pages 661--675, 2002.

Digital Library

[18]

G. Potamianos, C. Neti, and S. Deligne. Joint audio-visual speech processing for recognition and enhancement. In Proc. Work. Audio-Visual Speech Processing, pages 95--104, September 2003.

[19]

H. C. Schau and A. Z. Robinson. Passive source localization employing intersecting spherical surfaces from time-of-arrival differences. IEEE Trans. Acoust. Speech Signal Proc., ASSP-35(8):1223--5, August 1987.

[20]

J. O. Smith and J. S. Abel. Closed-form least-squares source location estimation from range-difference measurements. IEEE Trans. Acoust. Speech Signal Proc., ASSP-35(12):1661--9, December 1987.

[21]

C. Stauffer and W.E.L. Grimson. Adaptive background mixture models for real-time tracking. In CVPR, pages 246--252, 1999.

[22]

J. Vermaak, M. Gangnet, A. Blake, and P. Pérez. Sequential monte carlo fusion of sound and vision for speaker tracking. In Proc. IEEE Intl. Conf. on Computer Vision, volume 1, pages 741--746, 2001.

[23]

P. Viola and M. Jones. Robust real-time object detection. In ICCV Workshop on Statistical and Computation Theories of Vision, July 2001.

[24]

D. B. Ward, E. A. Lehmann, and R. C. Williamson. Particle filtering algorithms for tracking an acoustic source in a reverberant environment. IEEE Trans. Speech Audio Proc., 11(6):826--836, 2003.

[25]

M. Wölfel and J. McDonough. Combining multi-source far distance speech recognition strategies: beamforming, blind channel and confusion network combination. In Interspeech, to appear Sept. 2005.

[26]

C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-time tracking of the human body. IEEE Transaction on Pattern Analysis and Machine Intelligence, 19(7), 1997.

Digital Library

[27]

D. Zotkin, R. Duraiswami, and L. Davis. Joint audio-visual tracking using particle filters. EURASIP journal on Applied Signal Processing, 2002(11), 2002.

Digital Library

Cited By

Singh SLamba NKhosla A(2024)A closer look at single object tracking under variable hazeMultimedia Tools and Applications10.1007/s11042-024-19997-w83:38(85755-85780)Online publication date: 29-Aug-2024
Qian XBrutti ALanz OOmologo MCavallaro A(2022)Audio-Visual Tracking of Concurrent SpeakersIEEE Transactions on Multimedia10.1109/TMM.2021.306180024(942-954)Online publication date: 2022
Tian LOviatt SMuszynski MChamberlain BHealey JSano A(2022)Applied Affective ComputingundefinedOnline publication date: 25-Jan-2022
Show More Cited By

Index Terms

A joint particle filter for audio-visual speaker tracking
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Scene understanding

Recommendations

Multi-object tracking by mutual supervision of CNN and particle filter
Abstract
In the multi-object tracking process, a long-term tracking algorithm for traffic scene based on deep learning is proposed to handle several challenging problems, such as the complex variation of background illumination, change of pixel due to ...
Joint audio-visual tracking using particle filters

It is often advantageous to track objects in a scene using multimodal information when such information is available. We use audio as a complementary modality to video data, which, in comparison to vision, can provide faster localization over a wider ...
Multiple Speaker Tracking and Detection

Sönmez, Kemal, Heck, Larry, and Weintraub, Mitchel, Multiple Speaker Tracking and Detection: Handset Normalization and Duration Scoring, Digital Signal Processing10(2000), 133 142.We describe SRI's speaker tracking and detection system in the NIST 1998 ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '05: Proceedings of the 7th international conference on Multimodal interfaces

October 2005

344 pages

ISBN:1595930280

DOI:10.1145/1088463

General Chairs:
Gianni Lazzari
ITC-irst, Trento (Italy)
,
Fabio Pianesi
ITC-irst, Trento (Italy)
,
Program Chairs:
James Crowley
I.N.P. Grenoble (France)
,
Kenji Mase
Nagoya University (Japan)
,
Sharon Oviatt
Oregon Health & Sciences University

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICMI05

Sponsor:

ICMI05: Seventh International Conference on Multimodal Interfaces 2005

October 4 - 6, 2005

Torento, Italy

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

58
Total Citations
View Citations
786
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Singh SLamba NKhosla A(2024)A closer look at single object tracking under variable hazeMultimedia Tools and Applications10.1007/s11042-024-19997-w83:38(85755-85780)Online publication date: 29-Aug-2024
Qian XBrutti ALanz OOmologo MCavallaro A(2022)Audio-Visual Tracking of Concurrent SpeakersIEEE Transactions on Multimedia10.1109/TMM.2021.306180024(942-954)Online publication date: 2022
Tian LOviatt SMuszynski MChamberlain BHealey JSano A(2022)Applied Affective ComputingundefinedOnline publication date: 25-Jan-2022
Liu HSun YLi YYang B(2021)3D Audio-Visual Speaker Tracking with A Novel Particle Filter2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412682(7343-7348)Online publication date: 10-Jan-2021
Schymura CKolossa D(2020)Audiovisual Speaker Tracking Using Nonlinear Dynamical Systems With Dynamic Stream WeightsIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2020.298097428(1065-1078)Online publication date: 14-Apr-2020
Olagoke AIbrahim HTeoh S(2020)Literature Survey on Multi-Camera System and Its ApplicationIEEE Access10.1109/ACCESS.2020.30245688(172892-172922)Online publication date: 2020
Qian XBrutti ALanz OOmologo MCavallaro A(2019)Multi-Speaker Tracking From an Audio–Visual Sensing DeviceIEEE Transactions on Multimedia10.1109/TMM.2019.290248921:10(2576-2588)Online publication date: Oct-2019
Liu HLi YYang B(2019)3D Audio-Visual Speaker Tracking with A Two-Layer Particle Filter2019 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP.2019.8803117(1955-1959)Online publication date: Sep-2019
Yue LChen WLi XZuo WYin M(2019)A survey of sentiment analysis in social mediaKnowledge and Information Systems10.1007/s10115-018-1236-460:2(617-663)Online publication date: 1-Aug-2019
Cambria EPoria SHussain A(2019)Speaker-Independent Multimodal Sentiment Analysis for Big DataMultimodal Analytics for Next-Generation Big Data Technologies and Applications10.1007/978-3-319-97598-6_2(13-43)Online publication date: 19-Jul-2019
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten