Article

Fast unsupervised alignment of video and text for indexing/names and faces

Authors:

Subhransu Maji,

Ruzena BajcsyAuthors Info & Claims

MS '07: Workshop on multimedia information retrieval on The many faces of multimedia semantics

Pages 57 - 64

https://doi.org/10.1145/1290067.1290077

Published: 28 September 2007 Publication History

Abstract

We propose a novel way of combining weakly associated video/audio and text steams in an unsupervised manner which is faster than conventional speech recognition. The technique aligns audio/video and text streams which will enable video search using the associated text. Multimedia of this form includes news broadcast with summaries, parliament proceedings and court trials with transcripts, sports telecast with text commentary, etc. We also show how we can annotate the video with the names of the person appearing in the video which will allow name based indexing/search. We test the technique on a 80 minute video segment downloaded from the website of the International Court of the Former Yugoslavia, with the corresponding transcripts. The proposed technique achieves 88.49% accuracy on sentence level alignments and 95.5% accuracy on the task of assigning names to faces.

References

[1]

O. Arandjelovic and A. Zisserman. Automatic face recognition for film character retrieval in feature-length films. In CVPR05, pages I: 860--867, 2005.

Digital Library

[2]

K. Barnard, M. Johnson, and D. Forsyth. Word sense disambiguation with pictures. In Proceedings of the HLT-NAACL 2003 workshop on Learning word meaning from non-linguistic data, pages 1--5, Morristown, NJ, USA, 2003. Association for Computational Linguistics.

Digital Library

[3]

T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White, Y. W. Teh, E. Learned-Miller, and D. A. Forsyth. Names and faces in the news. In computer Vision and Pattern Recognition, pages 848--854, 2004.

Digital Library

[4]

C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: image segmentation using expectation-maximization and its application to image querying. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(8):1026--1038, 2002.

Digital Library

[5]

M. Everingham, J. Sivic, and A. Zisserman. Hello! my name is. buffy -- automatic naming of characters in tv video. In Proceedings of the British Machine Vision Conference, 2006.

[6]

The hidden markov model toolkit (htk), machine intelligence laboratory, cambridge university engineering department,http://htk.eng.cam.ac.uk/.

[7]

M. Isard and A. Blake. Condensation conditional density propagation forvisual tracking. Int. J. Comput. Vision, 29(1):5--28, 1998.

Digital Library

[8]

O. Maron and T. Lozano-Pérez. A framework for multiple-instance learning. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems, volume 10. The MIT Press, 1998.

Digital Library

[9]

O. Maron and A. L. Ratan. Multiple-instance learning for natural scene classification. In Proc. 15th International Conf. on Machine Learning, pages 341--349. Morgan Kaufmann, San Francisco, CA, 1998.

Digital Library

[10]

K. Mikolajczyk, R. Choudhury, and C. Schmid. Face detection in a video sequence: A temporal approach. In CVPR01, pages II:96--101, 2001.

[11]

Mplayer - the movie player, http://www.mplayerhq.hu/design7/info.html.

[12]

Open computer vision library, http://sourceforge.net/projects/opencvlibrary/.

[13]

Sox - sound exchange, http://sox.sourceforge.net/.

[14]

Cmusphinx: The carnegie mellon sphinx project, http://cmusphinx.sourceforge.net/html/cmusphinx.php.

[15]

M. Turk and A. Pentland. Face recognition using eigenfaces. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, page 586âAŞ591, 1991.

[16]

A. Venkataraman, A. Stolcke, W. Wang, D. Vergyri, V. R. R. Gadde, and J. Zheng. An efficient repair procedure for quick transcriptions. In Proceedings of ICSLP, 2000.

[17]

P. Viola and M. Jones. Robust real-time object detection. International Journal of Computer Vision - to appear, 2002.

Digital Library

[18]

M.-H. Yang. Kernel eigenfaces vs. kernel fisherfaces: Face recognition using kernel methods. In FGR '02: Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, page 215, Washington, DC, USA, 2002. IEEE Computer Society.

Digital Library

[19]

Q. Zhang and S. Goldman. Em-dd: An improved multiple-instance learning technique, 2001.

Cited By

Bouritsas GKoutras PZlatintsi AMaragos P(2018)Multimodal Visual Concept Learning with Weakly Supervised Techniques2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition10.1109/CVPR.2018.00516(4914-4923)Online publication date: Jun-2018
https://doi.org/10.1109/CVPR.2018.00516
Dusart TNurani Venkitasubramanian AMoens MSpampinato CMezaris Vvan Ossenbruggen J(2013)Cross-modal alignment for wildlife recognitionProceedings of the 2nd ACM international workshop on Multimedia analysis for ecological data10.1145/2509896.2509905(9-14)Online publication date: 22-Oct-2013
https://dl.acm.org/doi/10.1145/2509896.2509905
El Khoury ESenac CJoly PWang JBoujemaa NRamirez NNatsev A(2010)Face-and-clothing based people clustering in video contentProceedings of the international conference on Multimedia information retrieval10.1145/1743384.1743435(295-304)Online publication date: 29-Mar-2010
https://dl.acm.org/doi/10.1145/1743384.1743435

Index Terms

Fast unsupervised alignment of video and text for indexing/names and faces
1. Applied computing
  1. Document management and text processing

Recommendations

Cross-Media Alignment of Names and Faces

In this paper we report on our experiments on aligning names and faces as found in images and captions of online news Websites. Developing accurate technologies for linking names and faces is valuable when retrieving or mining information from ...
Indexing Faces in Broadcast News Video Archives
ICDMW '11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops

Face indexing and retrieval are basic tasks of search engines. Most current search engines use text information such as keywords and captions rather than visual content for indexing. This approach returns many irrelevant results, since faces and names ...
Names and faces in the news
CVPR'04: Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition

We show quite good face clustering is possible for a dataset of inaccurately and ambiguously labelled face images. Our dataset is 44,773 face images, obtained by applying a face finder to approximately half a million captioned news images. This dataset ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MS '07: Workshop on multimedia information retrieval on The many faces of multimedia semantics

September 2007

100 pages

ISBN:9781595937827

DOI:10.1145/1290067

General Chairs:
Farshad Fotouhi
Wayne State University, USA
,
William Grosky
University of Michigan-Dearborn, USA
,
Peter Stanchev
Kettering University, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 September 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

MM07

Sponsor:

MM07: The 15th ACM International Conference on Multimedia 2007

September 28, 2007

Bavaria, Augsburg, Germany

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
256
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bouritsas GKoutras PZlatintsi AMaragos P(2018)Multimodal Visual Concept Learning with Weakly Supervised Techniques2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition10.1109/CVPR.2018.00516(4914-4923)Online publication date: Jun-2018
https://doi.org/10.1109/CVPR.2018.00516
Dusart TNurani Venkitasubramanian AMoens MSpampinato CMezaris Vvan Ossenbruggen J(2013)Cross-modal alignment for wildlife recognitionProceedings of the 2nd ACM international workshop on Multimedia analysis for ecological data10.1145/2509896.2509905(9-14)Online publication date: 22-Oct-2013
https://dl.acm.org/doi/10.1145/2509896.2509905
El Khoury ESenac CJoly PWang JBoujemaa NRamirez NNatsev A(2010)Face-and-clothing based people clustering in video contentProceedings of the international conference on Multimedia information retrieval10.1145/1743384.1743435(295-304)Online publication date: 29-Mar-2010
https://dl.acm.org/doi/10.1145/1743384.1743435

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten