skip to main content
10.1145/1647314.1647327acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
poster

A speaker diarization method based on the probabilistic fusion of audio-visual location information

Published: 02 November 2009 Publication History

Abstract

This paper proposes a speaker diarization method for determining ""who spoke when"" in multi-party conversations, based on the probabilistic fusion of audio and visual location information. The audio and visual information is obtained from a compact system designed to analyze round table multi-party conversations. The system consists of two cameras and a triangular microphone array with three microphones, and can cover a spherical region. Speaker locations are estimated from audio and visual observations in terms of azimuths from this recording system. Unlike conventional speech diarization methods, our proposed method estimates the probability of the presence of multiple simultaneous speakers in a physical space with a small microphone setup instead of using a cascade consisting of speech activity detection, direction of arrival estimation, acoustic feature extraction, and information criteria based speaker segmentation. To estimate the speaker presence more correctly, the speech presence probabilities in a physical space are integrated with the probabilities estimated from participants' face locations obtained with a robust particle filtering based face tracker with two cameras equipped with fisheye lenses. The locations in a physical space with highly integrated probabilities are then classified into a certain number of speaker classes by using on-line classification to realize speaker diarization. The probability calculations and speaker classifications are conducted on-line, making it unnecessary to observe all the conversation data. An experiment using real casual conversations, which include more overlaps and short speech segments than formal meetings, showed the advantages of the proposed method.

References

[1]
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaj, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D., and Wellner, P., 2006. The AMI Meeting Corpus: A pre-announcement. Machine Learning for Multimodal Interaction, Renals, S. and Bengio, S. (Eds.), LNCS 3869, Springer-Verlag, 28--39.
[2]
Mostefa, D., Moreau, N., Chuoukri, K., Potamianos, G., Chu, S. M., Tyagi, A., Casas, J. R., Turmo, J., Cristoforetti, L., Tobia, F., Pnevmatikakis, A., Mylonakis, V., Tlantzis, F., Burger, S., Stiefelhagen, R., Bernardin, K., and Rochet, C., 2007. The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms, J. Language Resources and Evaluation 41, 389--407.
[3]
Fiscus, J. G., Ajot, J., and Garofolo, J. S., 2008. The Rich Transcription 2007 meeting recognition evaluation. Multimodal Technologies for Perception of Humans, Stifelhagen, R., Bowers, R., and Fiscus, J. (Eds.), LNCS 4625, 373--389.
[4]
Waibel, A., Bett, M., Metze, F., Ries, K., Schaaf, T., Schultz, T., Soltau, H., Yu, H., and Zechner, K., 2001. Advances in automatic meeting record creation and access. Proc. Int. Conf. Acoust., Speech, Signal Process., 597--600.
[5]
Cutler, R. and Davis, L., 2002. Distributed meetings: A meeting capture and broadcasting system. Proc. ACM Int. Conf. Multimedia, 503--512.
[6]
Morgan, N., Baron, D., Bhagat, S., Carvey, H., Dhillon, R., Edwards, J., Gelbart, D., Janin, A., Krupski, A., Peskin, B., Pfau, T., Shriberg, E., Stolke, A., and Wooters, C., 2003. Meetings about meetings: Research at ICSI on speech multiparty conversations. Proc. Int. Conf. Acoust., Speech, Signal Process. 4, 740--743.
[7]
Yu, Z., Ozeki, M., Fujii, Y., and Nakamura, Y., 2007. Towards smart meeting: Enabling technologies and a real-world application. Proc. ACM Int. Conf. Multimodal Interfaces, 86--93.
[8]
Tranter, S. E. and Reynolds, D. A., 2006. An overview of automatic speaker diarization systems. IEEE Trans. Audio, Speech, Language Process. 14, 1557--1565.
[9]
Wooters, C. and Huijbregts, M., 2008. The ICSI RT07s speaker diarization system. Multimodal Technologies for Perception of Humans, Stiefelhagen, R., Bowers, R., and Fiscus, J. (Eds.), LNCS 4625, 509--519.
[10]
van Leeuwen, D. A. and Konecný, M., 2008. Progress in the AMIDA speaker diarization system for meeting data. Multimodal Technologies for Perception of Humans, Stiefelhagen, R., Bowers, R., and Fiscus, J. (Eds.), LNCS 4625, 475--483.
[11]
Huang, J., Marcheret, E., Visweswariah, K., and Potamianos, G., 2008. The IBM RT07 evaluation systems for speaker diarization on lecture meetings. Multimodal Technologies for Perception of Humans, Stiefelhagen, R., Bowers, R., and Fiscus, J. (Eds.), LNCS 4625, 497--508.
[12]
Luque, J., Anguera, X., Temko, A., and Hernando, J., 2008. Speaker diarization for conference room: The UPC RT07s evaluation system. Multimodal Technologies for Perception of Humans, Stiefelhagen, R., Bowers, R., and Fiscus, J. (Eds.), LNCS 4625, 543--553.
[13]
Zhu, X., Barras, C., Lamel, L., and Gauvain, J.-L., 2006. Speaker diarization: From broadcast news to lectures. Machine Learning for Multimodal Interaction, Renals, S., Bengio, S., and Fiscus, J. (Eds.), LNCS 4299, 396--406.
[14]
Fredouille, C. and Evans, N., 2008. The LIA RT'07 Speaker Diarization System. Multimodal Technologies for Perception of Humans, Stiefelhagen, R., Bowers, R., and Fiscus, J. (Eds.), LNCS 4625, 520--532.
[15]
Chu, S. M., Marcheret, E., and Potamianos, G., 2006. Automatic speech recognition and speech activity detection in the CHIL smart room. Machine Learning for Multimodal Interaction, Renals, S., Bengio, S., and Fiscus, J. (Eds.), LNCS 4299, 332--343.
[16]
Martin, A., Charlet, D., and Mauuary, L., 2001. Robust speech/non-speech detection using LDA applied to MFCC. Proc. Int. Conf. Acoust., Speech, Signal Process., 237--240.
[17]
Padrell, J., Macho, D., and Nadeu, C., 2005. Robust speech activity detection using LDA applied to FF parameters. Proc. Int. Conf. Acoust., Speech, Signal Process. 1, 557--560.
[18]
Armani, L., Matassoni, M., Omologo, M., and Svaizer, P., 2003. Use of a CSP-based voice activity detector for distant-talking ASR. Proc. INTERSPEECH, 501--504.
[19]
Omologo, M., Svaizer, P., Brutti, A., and Cristoforetti, L., 2006. Speaker localization in CHIL lectures: Evaluation criteria and results. Machine Learning for Multimodal Interaction, Renals, S. and Bengio, S. (Eds.), LNCS 3869, 476--487.
[20]
Anguera, X., Wooters, C., and Hernando, J., 2007. Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio, Speech, Language Process. 15, 2011--2022.
[21]
Pardo, J. M., Anguera, X., and Wooters, C., 2006a. Speaker diarization for multi-microphone meetings using only between-channel differences. Machine Learning for Multimodal Interaction, Renals, S. and Bengio, S. (Eds.), LNCS 3869, Springer-Verlag, 257--264.
[22]
Pardo, J. M., Anguera, X., and Wooters, C., 2006b. Speaker diarization for multiple distant microphone meetings: Mixing acoustic features and inter-channel time differences. Proc. INTERSPEECH, 2194--2197.
[23]
Knapp, C. H. and Carter, G. C., 1976. The generalized correlation method for estimation of time delay. IEEE Trans. Acoust., Speech, Signal Process. ASSP-24, 320--327.
[24]
Otsuka, K., Araki, S., Ishizuka, K., Fujimoto, M., Heinrich, M., and Yamato, J., 2008. A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization. Proc. ACM Int. Conf. Multimodal Interfaces, 257--262.
[25]
Fujimoto, M., Ishizuka, K., and Nakatani, T., 2008. A voice activity detection based on the adaptive integration of multiple speech features and a signal decision scheme. Proc. Int. Conf. Acoust., Speech, Signal Process., 4441--4444.
[26]
Araki, S., Sawada, H., Mukai, R., and Makino, S., 2006. DOA estimation for multiple sparse sources with normalized observation vector clustering. Proc. Int. Conf. Acoust., Speech, Signal Process. 5, 33--36.
[27]
Mateo Lozano, O. and Otsuka, K., in press. Simultaneous and fast 3D tracking of multiple faces in video sequences by using a particle filter. J. Signal Process. Systems, DOI 10.1007/s11265-008-0250-2.
[28]
Khalidov, V., Forbes, F., Hansard, M., Arnaud, E., and Horaud, R., 2008. Audio-visual clustering for 3D speaker localization. Machine Learning for Multimodal Interaction, Popescu-Belis, A. and Stiefelhagen, R. (Eds.), LNCS 5237, 86--97.
[29]
Gatica-Perez, D., Lathoud, G., Odobez, J. M., and McCowan, I., 2007. Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE Trans. Audio, Speech, Language Process. 15, 601--616.
[30]
McCowan, I., Gatica-Perez, D., Bengio, S., Lathoud, G., Barnard, M., and Zhang, D., 2005. Automatic analysis of multimodal group actions in meetings. IEEE Trans. Pattern Anal. Mach. Intell. 27, 305--317.
[31]
Asano, F., Yamamoto, K., Hara, I., Ogata, J., Yoshimura, T., Motomura, Y., Ichimura, N., and Asoh, H., 2004. Detection and separation of speech event using audio and video information fusion and its application to robust speech interface. EURASIP J. Applied Signal Process. 11, 1727--1738.
[32]
Ba, S. O. and Odobez, J. M., 2008. Multi-party focus of attention recognition in meetings from head pose and multimodal contextual cues. Proc. Int. Conf. Acoust., Speech, Signal Process., 2221--2224.
[33]
Busso, C., Georgiou, P. G., and Narayanan, S. S., 2007. Real-time monitoring of participants' interaction in a meeting using audio-visual sensors. Proc. Int. Conf. Acoust., Speech, Signal Process. 2, 685--688.
[34]
Potamianos, G., Huang, J., Marcheret, E., Libal, V., Balchandran, R., Epstein, M., Seredi, L., Labsky, M., Ures, L., Black, M., and Lucey, P., 2008. Far-field multimodal speech processing and conversational interaction in smart spaces. Proc. Joint Workshop Hands-free Speech Commun. Microphone Arrays, 119--123.
[35]
Ishizuka, K., Araki, S. and Kawahara, T., 2008. Statistical speech activity detection based on spatial power distribution for analyses of poster presentations," Proc. INTERSPEECH, 99--102.
[36]
Friedland, G., Hung, H., and Yeo, C., 2009. Multi-modal speaker diarization of real-world meetings using compressed-domain video features. Proc. Int. Conf. Acoust., Speech, Signal Process., 4069--4072.
[37]
Sohn, J., Kim, N.-S., and Sung, W., 1999. A statistical model-based voice activity detection. IEEE Signal Process. Letters 6, 1--3.
[38]
Potamitis, I. and Fishler, E., 2004. Speech activity detection and enhancement of a moving speaker based on the wideband generalized likelihood ratio and microphone arrays. J. Acoust. Soc. Am. 116, 2406--2415.
[39]
Y1lmaz, Ö and Rickard, S., 2004. Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process. 52, 1830--1847.
[40]
Ephraim, Y. and Malah, D., 1984. Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoust., Speech, Signal Process. ASSP-32, 1109--1121.
[41]
Araki, S., Fujimoto, M., Ishizuka, K., Sawada, H., and Makino, S., 2008. A DOA based speaker diarization system for real meetings. Proc. Joint Workshop Hands-free Speech Commun. Microphone Arrays, 29--32.
[42]
Mikami, D., Otsuka, K., and Yamato, J., 2009. Memory-based particle filter for face pose tracking robust under complex dynamics. Proc. IEEE Conf. Computer Vision Pattern Recognition, 999--1006.

Cited By

View all
  • (2019)Development of Acoustic Nonverbal Information Estimation System for Unconstrained Long-Term Monitoring of Daily Office ActivityIEICE Transactions on Information and Systems10.1587/transinf.2018EDK0005E102.D:2(331-345)Online publication date: 1-Feb-2019
  • (2017)A Multifaceted Study on Eye Contact based Speaker Identification in Three-party ConversationsProceedings of the 2017 CHI Conference on Human Factors in Computing Systems10.1145/3025453.3025644(3011-3021)Online publication date: 2-May-2017
  • (2017)Analysis of Small GroupsSocial Signal Processing10.1017/9781316676202.025(349-367)Online publication date: 13-Jul-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI-MLMI '09: Proceedings of the 2009 international conference on Multimodal interfaces
November 2009
374 pages
ISBN:9781605587721
DOI:10.1145/1647314
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multi-modal systems
  2. multi-party conversation analysis
  3. speaker diarization

Qualifiers

  • Poster

Conference

ICMI-MLMI '09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Development of Acoustic Nonverbal Information Estimation System for Unconstrained Long-Term Monitoring of Daily Office ActivityIEICE Transactions on Information and Systems10.1587/transinf.2018EDK0005E102.D:2(331-345)Online publication date: 1-Feb-2019
  • (2017)A Multifaceted Study on Eye Contact based Speaker Identification in Three-party ConversationsProceedings of the 2017 CHI Conference on Human Factors in Computing Systems10.1145/3025453.3025644(3011-3021)Online publication date: 2-May-2017
  • (2017)Analysis of Small GroupsSocial Signal Processing10.1017/9781316676202.025(349-367)Online publication date: 13-Jul-2017
  • (2011)Multimodal conversation scene analysis for understanding people's communicative behaviors in face-to-face meetingsProceedings of the 1st international conference on Human interface and the management of information: interacting with information - Volume Part II10.5555/2021604.2021627(171-179)Online publication date: 9-Jul-2011
  • (2011)Multimodal Conversation Scene Analysis for Understanding People’s Communicative Behaviors in Face-to-Face MeetingsHuman Interface and the Management of Information. Interacting with Information10.1007/978-3-642-21669-5_21(171-179)Online publication date: 2011
  • (2010)Speech Activity Detection for Multi-Party Conversation Analyses Based on Likelihood Ratio Test on Spatial MagnitudeIEEE Transactions on Audio, Speech, and Language Processing10.1109/TASL.2009.203395518:6(1354-1365)Online publication date: 1-Aug-2010
  • (2010)Audiovisual Information Fusion in Human–Computer Interfaces and Intelligent Environments: A SurveyProceedings of the IEEE10.1109/JPROC.2010.205723198:10(1692-1715)Online publication date: Oct-2010

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media