ACM Home Page
Please provide us with feedback. Feedback
Joint visual-text modeling for automatic retrieval of multimedia documents
Full text PdfPdf (436 KB)
Source International Multimedia Conference archive
Proceedings of the 13th annual ACM international conference on Multimedia table of contents
Hilton, Singapore
SESSION: Content 1: news video processing table of contents
Pages: 21 - 30  
Year of Publication: 2005
ISBN:1-59593-044-2
Authors
G. Iyengar  IBM TJ Watson Research Center
P. Duygulu  Bilkent University
S. Feng  University of Massachusetts, Amherst
P. Ircing  Univ. West Bohemia
S. P. Khudanpur  Johns Hopkins University
D. Klakow  Saarland University
M. R. Krause  Georgetown University
R. Manmatha  University of Massachusetts, Amherst
H. J. Nock  IBM TJ Watson Research Center
D. Petkova  Mt. Holyoke College
B. Pytlik  Johns Hopkins University
P. Virga  Johns Hopkins University
Sponsors
ACM: Association for Computing Machinery
SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques
SIGMULTIMEDIA: ACM Special Interest Group on Multimedia
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 91,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1101149.1101154
What is a DOI?

ABSTRACT

In this paper we describe a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). We propose a novel framework where individual components are developed to model different relationships between documents and queries and then combined into a joint retrieval framework. In the state-of-the-art systems, a late combination between two independent systems, one analyzing just the text part of such documents, and the other analyzing the visual part without leveraging any knowledge acquired in the text processing, is the norm. Such systems rarely exceed the performance of any single modality (i.e. text or video) in information retrieval tasks. Our experiments indicate that allowing a rich interaction between the modalities results in significant improvement in performance over any single modality. We demonstrate these results using the TRECVID03 corpus, which comprises 120 hours of broadcast news videos. Our results demonstrate over 14 % improvement in IR performance over the best reported text-only baseline and ranks amongst the best results reported on this corpus.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. Berger and J. Lafferty. The Weaver System for Document Retrieval. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), pages 163--174. NIST Special Publication 500-246, 2000.
 
2
 
3
H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, pages 168--175, 2002.
 
4
J. Darroch and D. Ratcliff. Generalized Iterative Scaling for Log-Linear Models. The Annals of Mathematical Statistics, 43(5):1470--1480, 1972.
 
5
 
6
S. L. Feng,, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In Intl. Conf. on Computer Vision and Pattern Recognition, Washington D.C., June 2004.
7
8
 
9
A. Hauptmann, D. Ng, R. Baron, M. Chen, and et. al. Informedia at TRECVID 2003: Analyzing and searching broadcast news video. In Proceedings of TRECVID2003, Gaithersburg, MD, November 2003. NIST.
 
10
T. M. J. Baldridge and G. Bierner. openNLP maximum entropy modeling toolkit. http://maxent.sourceforge.net/, version 2.2.0, 2004.
11
 
12
D. Klakow. Log-linear interpolation of language models. In Proc. International Conference on Speech and Language Processing (ICSLP, Sydney, Australia, November 1998.
13
 
14
V. Lavrenko, S. L. Feng, and R. Manmatha. Statistical models for automatic video annotation and retrieval. In Intl. Conf. On Acoust., Sp., and Sig. Proc., pages 417--420, Montreal, QC, May 2004.
 
15
C.-Y. Lin, B. Tseng, and J. R. Smith. Video Collaborative Annotation Forum: Establishing Ground-truth Labels on Large Multimedia Datasets. In Proceedings of the TRECVID2003: NIST Special Publications, Gaithersburg, MD, 2003. NIST.
 
16
NIST. Proceedings of the TREC Video Retrieval Evaluation Conference(TRECVID2003), Gaithersburg, MD, November 2003.
 
17
NIST. Proceedings of the TREC Video Retrieval Evaluation Conference(TRECVID2004), Gaithersburg, MD, November 2004.
18
 
19
A. Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In E. Brill and K. Church, editors, Proc. Conf. on Empirical Methods in Natural Language Processing, pages 133--142. Assn Comp. Ling., Somerset, New Jersey, 1996.
 
20
 
21
N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368--377, 1999.
 
22
T. Westerveld and A. P. de Vries. Multimedia retrieval using multiple examples. In Proceedings of Conference on Image and Video Retrieval CIVR, Dublin, Ireland, July 2004.
 
23
24


Collaborative Colleagues:
G. Iyengar: colleagues
P. Duygulu: colleagues
S. Feng: colleagues
P. Ircing: colleagues
S. P. Khudanpur: colleagues
D. Klakow: colleagues
M. R. Krause: colleagues
R. Manmatha: colleagues
H. J. Nock: colleagues
D. Petkova: colleagues
B. Pytlik: colleagues
P. Virga: colleagues