|
ABSTRACT
Tracking speakers in multiparty conversations constitutes a fundamental task for automatic meeting analysis. In this paper, we present a probabilistic approach to jointly track the location and speaking activity of multiple speakers in a multisensor meeting room, equipped with a small microphone array and multiple uncalibrated cameras. Our framework is based on a mixed-state dynamic graphical model defined on a multiperson state-space, which includes the explicit definition of a proximity-based interaction model. The model integrates audio-visual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm. Visual observations are based on models of the shape and spatial structure of human heads. Approximate inference in our model, needed given its complexity, is performed with a Markov Chain Monte Carlo particle filter (MCMC-PF), which results in high sampling efficiency. We present results -based on an objective evaluation procedure-that show that our framework (1) is capable of locating and tracking the position and speaking activity of multiple meeting participants engaged in real conversations with good accuracy; (2) can deal with cases of visual clutter and partial occlusion; and (3) significantly outperforms a traditional sampling-based approach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
N. Checka, K. Wilson, M. Siracusa, and T. Darrell, "Multiple person and speaker activity tracking with a particle filter," in Proc. ICASSP, May 2004.
|
| |
3
|
Y. Chen and Y. Rui, "Real-time speaker tracking using particle filter sensor fusion," Proc. of the IEEE, vol. 92, no. 3, pp. 485--494, Mar. 2004.
|
 |
4
|
Ross Cutler , Yong Rui , Anoop Gupta , JJ Cadiz , Ivan Tashev , Li-wei He , Alex Colburn , Zhengyou Zhang , Zicheng Liu , Steve Silverberg, Distributed meetings: a meeting capture and broadcasting system, Proceedings of the tenth ACM international conference on Multimedia, December 01-06, 2002, Juan-les-Pins, France
[doi> 10.1145/641007.641112]
|
| |
5
|
J. DiBiase, H. Silverman, and M. Brandstein, "Robust localization in reverberant rooms," in Microphone Arrays, Ch. 8, pp. 157--180. Springer, 2001.
|
| |
6
|
D. Gatica-Perez, G. Lathoud, I. McCowan, and J.-M. Odobez, "A mixed-state i-particle filter for multi-camera speaker tracking," in Proc. ICCV-WOMTEC, Oct. 2003.
|
| |
7
|
M. Isard, Visual Motion Analysis by Probabilistic Propagation of Conditional Density, PhD Thesis, 1998.
|
| |
8
|
M. Isard and J. MacCormick, "Bramble: A Bayesian multi-blob tracker," in Proc. ICCV, Jul. 2001.
|
| |
9
|
Z. Khan, T. Balch, and F. Dellaert, "An MCMC-based particle filter for tracking multiple interacting targets," in Proc. ECCV, May 2004.
|
| |
10
|
|
| |
11
|
J.S. Liu, Monte Carlo Strategies in Scientific Computing, Springer-Verlag, 2001.
|
| |
12
|
J.E. McGrath, Groups: Interaction and Performance, Prentice-Hall, 1984.
|
| |
13
|
V. Pavlovic, A. Garg, and J. Rehg, "Multimodal speaker detection using error feedback dynamic Bayesian networks," in Proc. CVPR, Jun. 2000.
|
| |
14
|
|
| |
15
|
J. Vermaak, M. Gagnet, A. Blake, and P. Perez, "Sequential Monte Carlo fusion of sound and vision for speaker tracking," in Proc. ICCV, July 2001.
|
| |
16
|
P. Viola and M. Jones, "Rapid object detection by boosted cascade of simple features," in Proc. CVPR, Dec. 2001.
|
CITED BY 3
|
Alexander M. Arthur , Rebecca Lunsford , Matt Wesson , Sharon Oviatt, Prototyping novel collaborative multimodal systems: simulation, data collection and analysis tools for the next decade, Proceedings of the 8th international conference on Multimodal interfaces, November 02-04, 2006, Banff, Alberta, Canada
|
|
|
|
|
|