skip to main content
10.1145/1322192.1322254acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

On-line multi-modal speaker diarization

Published: 12 November 2007 Publication History

Abstract

This paper presents a novel framework that utilizes multi-modal information to achieve speaker diarization. We use dynamic Bayesian networks to achieve on-line results. We progress from a simple observation model to a complex multi-modal one as more data becomes available. We present an efficient way to guide the learning procedure of the complex model using the early results achieved with the simple model. We present the results achieved in various real-world situations, including videos coming from webcameras, human computer interaction and video conferences.

References

[1]
J. Ajmera and C. Wooters. A robust speaker clustering algorithm. In IEEE ASRU Workshop, 2003.
[2]
Xavier Anguera, Chuck Wooters, and Javier Hernando. Automatic cluster complexity and quantity selection: Towards robust speaker diarization. In Steve Renals, Samy Bengio, and Jonathan G. Fiscus, editors, MLMI, volume 4299 of Lecture Notes in Computer Science, pages 248--256. Springer, 2006.
[3]
S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for on-line non-linear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing, 50(2):174--188, February 2002.
[4]
M. Beal, N. Jojic, and H. Attias. A graphical model for audiovisual object tracking. IEEE Trans. on Pattern Analysis and Machine Intelligence, 25(7):828--836, 2003.
[5]
S. Chen and P. Gopalakrishnan. Speaker, environment and channel change detection and clustering via the bayesian information criterion. In DARPA speech recognition workshop, 1998.
[6]
J. W. Fisher III, J. Ihler, and P. Viola. Learning informative statistics: A nonparametric approach. In Advances in Neural Information Processing Systems 12, Denver, Colorado., 1999.
[7]
III John W. Fisher and Trevor Darrell. Probabalistic models and informative subspaces for audiovisual correspondence. In ECCV '02: Proceedings of the 7th European Conference on Computer Vision-Part III, pages 592--603, London, UK, 2002. Springer-Verlag.
[8]
Nebojsa Jojic, John Winn, and Larry Zitnick. Escaping local minima through hierarchical model selection: Automatic object discovery, segmentation, and tracking in video. cvpr, 1:117--124, 2006.
[9]
Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Technical Report AIM-1440, 1993.
[10]
C. Kwok, D. Fox, and M. Meil. Real-time particle filters. In Advances in Neural Information Processing Systems 15, 2002.
[11]
D. Lowe. Distinctive image features from scale-invariant keypoints. In International Journal of Computer Vision, volume 20, pages 91--110, 2003.
[12]
Petr Motlícek, Lukás Burget, and Jan Cernocký. Non-parametric speaker turn segmentation of meeting data. In Interspeech'2005 -- Eurospeech -- 9th European Conference on Speech Communication and Technology, volume 2005, pages 657--660, 2005.
[13]
Harriet J. Nock, Giridharan Iyengar, and Chalapathy Neti. Multimodal processing by finding common cause. Commun. ACM, 47(1):51--56, 2004.
[14]
Athanasios K. Noulas and Ben JA. Kröse. E.M. detection of common origin of multi-modal cues. In International Conference on Multimodal Interfaces, pages 201--208, 2006.
[15]
Steven J. Nowlan and Geoffrey E. Hinton. Evaluation of adaptive mixtures of competing experts. In NIPS-3: Proceedings of the 1990 conference on Advances in neural information processing systems 3, pages 774--780, San Francisco, CA, USA, 1990. Morgan Kaufmann Publishers Inc.

Cited By

View all
  • (2024)A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streamsEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-024-00382-22024:1Online publication date: 28-Nov-2024
  • (2020)Speech Enhancement for Multimodal Speaker Diarization SystemIEEE Access10.1109/ACCESS.2020.30073128(126671-126680)Online publication date: 2020
  • (2019)Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization ModelSensors10.3390/s1923516319:23(5163)Online publication date: 25-Nov-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '07: Proceedings of the 9th international conference on Multimodal interfaces
November 2007
402 pages
ISBN:9781595938176
DOI:10.1145/1322192
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. audio-visual
  2. multi-modal
  3. speaker detection
  4. speaker diarization

Qualifiers

  • Research-article

Conference

ICMI07
Sponsor:
ICMI07: International Conference on Multimodal Interface
November 12 - 15, 2007
Aichi, Nagoya, Japan

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streamsEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-024-00382-22024:1Online publication date: 28-Nov-2024
  • (2020)Speech Enhancement for Multimodal Speaker Diarization SystemIEEE Access10.1109/ACCESS.2020.30073128(126671-126680)Online publication date: 2020
  • (2019)Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization ModelSensors10.3390/s1923516319:23(5163)Online publication date: 25-Nov-2019
  • (2018)Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian FusionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2017.264879340:5(1086-1099)Online publication date: 1-May-2018
  • (2015)Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVMIEEE Transactions on Multimedia10.1109/TMM.2015.246372217:10(1694-1705)Online publication date: Oct-2015
  • (2014)Motion history images for online speaker/signer diarization2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2014.6853855(1537-1541)Online publication date: May-2014
  • (2013)Computational Audiovisual Scene Analysis in Online Adaptation of Audio-Motor MapsIEEE Transactions on Autonomous Mental Development10.1109/TAMD.2013.22577665:4(273-287)Online publication date: 1-Dec-2013
  • (2012)Probabilistic Speaker Diarization With Bag-of-Words Representations of Speaker Angle InformationIEEE Transactions on Audio, Speech, and Language Processing10.1109/TASL.2011.215185820:2(447-460)Online publication date: 1-Feb-2012
  • (2012)Speaker DiarizationIEEE Transactions on Audio, Speech, and Language Processing10.1109/TASL.2011.212595420:2(356-370)Online publication date: 1-Feb-2012
  • (2012)Simple auditory and visual features for human-robot dialog scene analysis2012 IEEE/RSJ International Conference on Intelligent Robots and Systems10.1109/IROS.2012.6385534(700-706)Online publication date: Oct-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media