skip to main content
10.1145/2647868.2654929acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

"Sheldon speaking, Bonjour!": Leveraging Multilingual Tracks for (Weakly) Supervised Speaker Identification

Published: 03 November 2014 Publication History

Abstract

We address the problem of speaker identification in multimedia data, and TV series in particular. While speaker identification is traditionally a supervised machine-learning task, our first contribution is to significantly reduce the need for costly preliminary manual annotations through the use of automatically aligned (and potentially noisy) fan-generated transcripts and subtitles.
We show that both speech activity detection and speech turn identification modules trained in this weakly supervised manner achieve similar performance as their fully supervised counterparts (i.e. relying on fine manual speech/non-speech/speaker annotation).
Our second contribution relates to the use of multilingual audio tracks usually available with this kind of content to significantly improve the overall speaker identification performance. Reproducible experiments (including dataset, manual annotations and source code) performed on the first six episodes of The Big Bang Theory TV series show that combining the French audio track (containing dubbed actor voices) with the English one (with the original actor voices) improves the overall English speaker identification performance by 5% absolute and up to 70% relative on the five main characters.

References

[1]
A. Allauzen, N. Pécheux, Q. K. Do, M. Dinarelli, T. Lavergne, A. Max, H.-S. Le, and F. Yvon. LIMSI @ WMT13. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 62--69, Sofia, Bulgaria, 2013. Association for Computational Linguistics.
[2]
C. Barras and J.-L. Gauvain. Feature and Score Normalization for Speaker Verification of Cellular Data. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 49--52, 2003.
[3]
C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain. Multi-Stage Speaker Diarization of Broadcast News. IEEE Transactions on Audio, Speech and Language Processing, 14(5):1505--1512, 2006.
[4]
M. Bäuml, M. Tapaswi, and R. Stiefelhagen. Semi-supervised Learning with Constraints for Person Identification in Multimedia Data. In International Conference on Computer Vision and Pattern Recognition, 2013.
[5]
O. Bojar, C. Buck, C. Callison-Burch, C. Federmann, B. Haddow, P. Koehn, C. Monz, M. Post, R. Soricut, and L. Specia. Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1--44, Sofia, Bulgaria, 2013.
[6]
H. Bredin. Segmentation of TV Shows into Scenes using Speaker Diarization and Speech Recognition. In International Conference on Acoustics, Speech, and Signal Processing, Kyoto, Japan, March 2012.
[7]
H. Bredin, A. Laurent, A. Sarkar, V.-B. Le, S. Rosset, and C. Barras. Person Instance Graphs for Named Speaker Identification in TV Broadcast. In Odyssey 2014, The Speaker and Language Recognition Workshop, Joensuu, Finland, June 2014.
[8]
H. Bredin and J. Poignant. Integer Linear Programming for Speaker Diarization and Cross-Modal Identification in TV Broadcast. In Proceedings of the 14th Annual Conference of the International Speech Communication Association, Lyon, France, August 2013.
[9]
C. Callison-Burch, P. Koehn, C. Monz, M. Post, R. Soricut, and L. Specia. Findings of the 2012 Workshop on Statistical Machine Translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 10--51, Montréal, Canada, 2012.
[10]
L. Canseco, L. Lamel, and J.-L. Gauvain. A Comparative Study Using Manual and Automatic Transcriptions for Diarization. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, pages 415--419, 2005.
[11]
F. Casacuberta and E. Vidal. Machine Translation with Inferred Stochastic Finite-State transducers. Computational Linguistics, 30(3):205--225, 2004.
[12]
S. S. Chen and P. Gopalakrishnan. Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion. In DARPA Broadcast News Transcription and Understanding Workshop, Virginia, USA, 1998.
[13]
T. Cour, B. Sapp, A. Nagle, and B. Taskar. Talking Pictures: Temporal Grouping and Dialog-Supervised Person Recognition. In International Conference on Computer Vision and Pattern Recognition, 2010.
[14]
J. M. Crego and J. B. Mariąo. Improving Statistical MT by Coupling Reordering and Decoding. Machine Translation, 20(3):199--215, 2006.
[15]
J. M. Crego, F. Yvon, and J. B. Mariño. N-code: an open-source bilingual N-gram SMT toolkit. Prague Bulletin of Mathematical Linguistics, 96:49--58, 2011.
[16]
Y. Estève, S. Meignier, P. Deléglise, and J. Mauclair. Extracting true speaker identities from transcriptions. In Proceedings of the International Speech Communication Association, pages 2601--2604, 2007.
[17]
M. Everingham, J. Sivic, and A. Zisserman. "Hello! My name is... Buffy" Automatic Naming of Characters in TV Video. In British Machine Vision Conference, 2006.
[18]
G. Friedland, L. R. Gottlieb, and A. Janin. Joke-o-mat: browsing sitcoms punchline by punchline. ACM Multimedia, pages 1115--1116, 2009.
[19]
J.-L. Gauvain and C.-H. Lee. Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Transactions on Speech and Audio Processing, 2(2):291--298, April 1994.
[20]
V. Jousse, S. Petitrenaud, S. Meignier, Y. Estève, and C. Jacquin. Automatic Named Identification of Speakers using Diarization and ASR Systems. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Taïpei, Taïwan, April 2009.
[21]
J. Kahn, O. Galibert, L. Quintard, M. Carre, A. Giraudel, and P. Joly. A Presentation of the REPERE Challenge. In International Workshop on Content-Based Multimedia Indexing, pages 1--6, 2012.
[22]
P. Koehn. Statistical Machine Translation. Cambridge University Press, New York, NY, USA, 1st edition, 2010.
[23]
J. B. Mariño, R. E. Banchs, J. M. Crego, A. de Gispert, P. Lambert, J. A. Fonollosa, and M. R. Costa-Jussà. N-gram-based Machine Translation. Computational Linguistics, 32(4):527--549, 2006.
[24]
A. F. Martin and M. A. Przybocki. The NIST 1999 Speaker Recognition Evaluation - An Overview. Digital Signal Processing, 10(1--3):1--18, 2000.
[25]
B. Mathieu, S. Essid, T. Fillon, J. Prado, and G. Richard. YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software. In Proceedings of the 11th ISMIR Conference, Utrecht, Netherlands, 2010.
[26]
J. Mauclair, S. Meignier, and Y. Estève. Speaker Diarization : about whom the Speaker is Talking? In IEEE Odyssey, 2006.
[27]
E. Myers. An O(ND) Difference Algorithm and its Variations. Algorithmica, 1(2):251--266, 1986.
[28]
R. Nelken and S. Shieber. Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora. In Proceedings of the 11th Conference of the European Chapter of the ACL, 2006.
[29]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.
[30]
J. Poignant, L. Besacier, V.-B. Le, S. Rosset, and G. Quénot. Unsupervised Naming of Speakers in Broadcast TV: using Written Names, Pronounced Names or Both? In Proceedings of the 14th Annual Conference of the International Speech Communication Association, Lyon, France, August 2013.
[31]
L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2):257--286, 1989.
[32]
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 10(1--3):19--41, 2000.
[33]
A. Roy, C. Guinaudeau, H. Bredin, and C. Barras. TVD: a Reproducible and Multiply Aligned TV Series Dataset. In LREC 2014, 9th Language Resources and Evaluation Conference, 2014.
[34]
J. Sivic, M. Everingham, and A. Zisserman. "Who are you?" - Learning Person Specific Classifiers from Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
[35]
J. Sivic and A. Zisserman. Efficient Visual Search of Videos Cast as Text Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4):591--606, 2009.
[36]
M. Tapaswi, M. Bäuml, and R. Stiefelhagen. "Knock! Knock! Who is it?" Probabilistic Person Identification in TV-Series. In International Conference on Computer Vision and Pattern Recognition, 2012.
[37]
C. Tillmann. A unigram orientation model for statistical machine translation. In Proceedings of HLT-NAACL, pages 101--104, 2004.
[38]
S. E. Tranter. Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pages 1013--1016, 2006.
[39]
S. E. Tranter and D. A. Reynolds. An Overview of Automatic Speaker Diarization Systems. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1557--1565, September.

Cited By

View all
  • (2016)Improving Speaker Diarization of TV Series using Talking-Face Detection and ClusteringProceedings of the 24th ACM international conference on Multimedia10.1145/2964284.2967202(157-161)Online publication date: 1-Oct-2016
  • (2016)Unsupervised person clustering in videos with cross-modal communication2016 Visual Communications and Image Processing (VCIP)10.1109/VCIP.2016.7805581(1-4)Online publication date: Nov-2016

Index Terms

  1. "Sheldon speaking, Bonjour!": Leveraging Multilingual Tracks for (Weakly) Supervised Speaker Identification

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '14: Proceedings of the 22nd ACM international conference on Multimedia
      November 2014
      1310 pages
      ISBN:9781450330633
      DOI:10.1145/2647868
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 November 2014

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. multilingual fusion
      2. multimedia data
      3. speaker identification
      4. speech activity detection
      5. weak supervision

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      MM '14
      Sponsor:
      MM '14: 2014 ACM Multimedia Conference
      November 3 - 7, 2014
      Florida, Orlando, USA

      Acceptance Rates

      MM '14 Paper Acceptance Rate 55 of 286 submissions, 19%;
      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 20 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2016)Improving Speaker Diarization of TV Series using Talking-Face Detection and ClusteringProceedings of the 24th ACM international conference on Multimedia10.1145/2964284.2967202(157-161)Online publication date: 1-Oct-2016
      • (2016)Unsupervised person clustering in videos with cross-modal communication2016 Visual Communications and Image Processing (VCIP)10.1109/VCIP.2016.7805581(1-4)Online publication date: Nov-2016

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media