skip to main content
10.1145/1290067.1290077acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
Article

Fast unsupervised alignment of video and text for indexing/names and faces

Published: 28 September 2007 Publication History

Abstract

We propose a novel way of combining weakly associated video/audio and text steams in an unsupervised manner which is faster than conventional speech recognition. The technique aligns audio/video and text streams which will enable video search using the associated text. Multimedia of this form includes news broadcast with summaries, parliament proceedings and court trials with transcripts, sports telecast with text commentary, etc. We also show how we can annotate the video with the names of the person appearing in the video which will allow name based indexing/search. We test the technique on a 80 minute video segment downloaded from the website of the International Court of the Former Yugoslavia, with the corresponding transcripts. The proposed technique achieves 88.49% accuracy on sentence level alignments and 95.5% accuracy on the task of assigning names to faces.

References

[1]
O. Arandjelovic and A. Zisserman. Automatic face recognition for film character retrieval in feature-length films. In CVPR05, pages I: 860--867, 2005.
[2]
K. Barnard, M. Johnson, and D. Forsyth. Word sense disambiguation with pictures. In Proceedings of the HLT-NAACL 2003 workshop on Learning word meaning from non-linguistic data, pages 1--5, Morristown, NJ, USA, 2003. Association for Computational Linguistics.
[3]
T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White, Y. W. Teh, E. Learned-Miller, and D. A. Forsyth. Names and faces in the news. In computer Vision and Pattern Recognition, pages 848--854, 2004.
[4]
C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: image segmentation using expectation-maximization and its application to image querying. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(8):1026--1038, 2002.
[5]
M. Everingham, J. Sivic, and A. Zisserman. Hello! my name is. buffy -- automatic naming of characters in tv video. In Proceedings of the British Machine Vision Conference, 2006.
[6]
The hidden markov model toolkit (htk), machine intelligence laboratory, cambridge university engineering department,http://htk.eng.cam.ac.uk/.
[7]
M. Isard and A. Blake. Condensation conditional density propagation forvisual tracking. Int. J. Comput. Vision, 29(1):5--28, 1998.
[8]
O. Maron and T. Lozano-Pérez. A framework for multiple-instance learning. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems, volume 10. The MIT Press, 1998.
[9]
O. Maron and A. L. Ratan. Multiple-instance learning for natural scene classification. In Proc. 15th International Conf. on Machine Learning, pages 341--349. Morgan Kaufmann, San Francisco, CA, 1998.
[10]
K. Mikolajczyk, R. Choudhury, and C. Schmid. Face detection in a video sequence: A temporal approach. In CVPR01, pages II:96--101, 2001.
[11]
Mplayer - the movie player, http://www.mplayerhq.hu/design7/info.html.
[12]
Open computer vision library, http://sourceforge.net/projects/opencvlibrary/.
[13]
Sox - sound exchange, http://sox.sourceforge.net/.
[14]
Cmusphinx: The carnegie mellon sphinx project, http://cmusphinx.sourceforge.net/html/cmusphinx.php.
[15]
M. Turk and A. Pentland. Face recognition using eigenfaces. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, page 586âAŞ591, 1991.
[16]
A. Venkataraman, A. Stolcke, W. Wang, D. Vergyri, V. R. R. Gadde, and J. Zheng. An efficient repair procedure for quick transcriptions. In Proceedings of ICSLP, 2000.
[17]
P. Viola and M. Jones. Robust real-time object detection. International Journal of Computer Vision - to appear, 2002.
[18]
M.-H. Yang. Kernel eigenfaces vs. kernel fisherfaces: Face recognition using kernel methods. In FGR '02: Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, page 215, Washington, DC, USA, 2002. IEEE Computer Society.
[19]
Q. Zhang and S. Goldman. Em-dd: An improved multiple-instance learning technique, 2001.

Cited By

View all
  • (2018)Multimodal Visual Concept Learning with Weakly Supervised Techniques2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition10.1109/CVPR.2018.00516(4914-4923)Online publication date: Jun-2018
  • (2013)Cross-modal alignment for wildlife recognitionProceedings of the 2nd ACM international workshop on Multimedia analysis for ecological data10.1145/2509896.2509905(9-14)Online publication date: 22-Oct-2013
  • (2010)Face-and-clothing based people clustering in video contentProceedings of the international conference on Multimedia information retrieval10.1145/1743384.1743435(295-304)Online publication date: 29-Mar-2010

Index Terms

  1. Fast unsupervised alignment of video and text for indexing/names and faces

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MS '07: Workshop on multimedia information retrieval on The many faces of multimedia semantics
    September 2007
    100 pages
    ISBN:9781595937827
    DOI:10.1145/1290067
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 September 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. indexing
    2. multimedia alignment
    3. names and faces

    Qualifiers

    • Article

    Conference

    MM07
    MM07: The 15th ACM International Conference on Multimedia 2007
    September 28, 2007
    Bavaria, Augsburg, Germany

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Multimodal Visual Concept Learning with Weakly Supervised Techniques2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition10.1109/CVPR.2018.00516(4914-4923)Online publication date: Jun-2018
    • (2013)Cross-modal alignment for wildlife recognitionProceedings of the 2nd ACM international workshop on Multimedia analysis for ecological data10.1145/2509896.2509905(9-14)Online publication date: 22-Oct-2013
    • (2010)Face-and-clothing based people clustering in video contentProceedings of the international conference on Multimedia information retrieval10.1145/1743384.1743435(295-304)Online publication date: 29-Mar-2010

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media