ACM Home Page
Please provide us with feedback. Feedback
Optimal multimodal fusion for multimedia data analysis
Full text PdfPdf (350 KB)
Source International Multimedia Conference archive
Proceedings of the 12th annual ACM international conference on Multimedia table of contents
New York, NY, USA
SESSION: Technical session 6: learning in multi-modal data table of contents
Pages: 572 - 579  
Year of Publication: 2004
ISBN:1-58113-893-8
Authors
Yi Wu  University of California Santa Barbara, Santa Barbara, CA
Edward Y. Chang  University of California Santa Barbara, Santa Barbara, CA
Kevin Chen-Chuan Chang  University of Illinois at Urbana-Champaign, Urbana, IL
John R. Smith  IBM T.J. Watson Research Center, Hawthorne, NY
Sponsors
SIGMULTIMEDIA: ACM Special Interest Group on Multimedia
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 32,   Downloads (12 Months): 146,   Citation Count: 23
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1027527.1027665
What is a DOI?

ABSTRACT

Considerable research has been devoted to utilizing multimodal features for better understanding multimedia data. However, two core research issues have not yet been adequately addressed. First, given a set of features extracted from multiple media sources (e.g., extracted from the visual, audio, and caption track of videos), how do we determine the best modalities? Second, once a set of modalities has been identified, how do we best fuse them to map to semantics? In this paper, we propose a two-step approach. The first step finds <i>statistically independent modalities</i> from raw features. In the second step, we use <i>super-kernel fusion</i> to determine the optimal combination of individual modalities. We carefully analyze the tradeoffs between three design factors that affect fusion performance: <i>modality independence</i>, <i>curse of dimensionality</i>, and <i>fusion-model complexity</i>. Through analytical and empirical studies, we demonstrate that our two-step approach, which achieves a careful balance of the three design factors, can improve class-prediction accuracy over traditional techniques.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation. Advances in Neural Information Processing Systems, 8:757--763, 1996.
 
2
A. Amir, H. W, G. Iyengar, C.-Y.Lin, M. Naphade, A. Natsev, C. Neti, H. J. Nock, J. R. Smith, B. L. Tseng, Y. Wu, and D. Zhang. IBM research TRECVID-2003 system. NIST Text Retrieval Conf. (TREC), 2003.
 
3
 
4
M. S. Bartlett, H. M. Lades, and T. J. Sejnowski. Independent component representation for face recognition. SPIE Conf. on Human Vision and Electronic Imaging III, 3299:528--539, 1998.
 
5
 
6
R. Bellman. Adaptive control processes. Princeton, 1961.
 
7
 
8
 
9
T. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. Artifical Intelligence Research, 2:263--286, 1995.
 
10
 
11
D. L. Donoho. High-dimensional data analysis: The curses and blessings of dimensionality. American Math. Society Lecture---Match Challenges of the 21st Century, 2000.
12
 
13
14
 
15
L. Hansen, J. Larsen, and T. Kolenda. On independent component analysis for multimedia signals. Multimedia Image and VideoProcessing, CRC Press, 2000.
 
16
J. Hershey and J. Movellan. Using audio-visual synchrony to locate sounds. Advances in Neural Information Processing Systems 12. MIT Press, Cambridge MA, 2001.
 
17
 
18
J. F. III, T. Darrell, W. Freeman, and P. Viola. Learning joint statistical models for audio-visual fusion and segregation. Advances in Neural Information Processing Systems 13. MIT Press, Cambridge MA, 2000.
 
19
I. Joliffe. Principal component analysis. Springer-Verlag, New York, 1986.
 
20
J. Kittler, M. Hatef, and R. P. W. Duin. Combining classifiers. Intl. Pattern Recognition, pages 897--901, 1996.
 
21
T. Kolenda, L. K. Hansen, J. Larsen, and O. Winther. Independent component analysis for understanding multimedia content. IEEE Workshop on Neural Networks for Signal Processing, pages 757--766, 2002.
 
22
B. Li and E. Chang. Discovery of a perceptual distance function for measuring image similarity. ACM Multimedia Journal Special Issue on Content-Based Image Retrieval, 8(6):512--522, 2003.
 
23
A. S. Lukic, M. N. Wernick, L. K. Hansen, and S. C. Strother. An ICA algorithm for analyzing multiple data sets. IEEE Int. Conf. on Image Processing, pages 821--824, 2002.
 
24
J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. Advances in Large Margin Classifiers, MIT Press, pages 61--74, 2000.
 
25
Y. Rui, T. S. Huang, and S. F. Chang. Image retrieval: Past, present, and future. International Symposium on Multimedia Information Processing, 1997.
 
26
Y. Rui, T. S. Huang, and S. Mehrotra. Content-based image retrieval with relevance feedback in mars. IEEE International Conference on Image, 1997.
 
27
P. Smaragdis and M. Casey. Audio/visual independent components. International Symposium on Independent Component Analysis and Blind Source Separation, pages 709--714, 2003.
 
28
J. R. Smith and S. F. Chang. Automatic image retrieval using color and texture. IEEE Trans Pattern Anal Mach Intell, 1996.
 
29
D. M. J. Tax, M. V. Breukelen, R. P. W. Duin, and J. Kittler. Combing multiple classifiers by averaging or by multiplying. Pattern Recognition, 33:1475--1485, 2000.
 
30
K. M. Ting and I. H. Witten. Issues in styacked generalization. Artificial Intelligence Research, 10:271--289, 1999.
 
31
A. Velivelli, C. W. Ngo, and T. S. Huang. Detection of documentarty scene changes by audio-visual fusion. International conference on Image and video retrieval, pages 227--237, 2003.
 
32
A. Vinokourov, D. R. Hardoon, and J. Shawe-Taylor. Learning the semantics of multimedia content with application to web image retrieval and classification. Fourth International Symposium on Independent Component Analysis and Blind Source Separation, 2003.
 
33
A. Vinokourov, J. Shawe-Taylor, and N. Cristianini. Inferring a semantic representation of text via cross-language correlation analysis. Advances of Neural Information Processing, 2002.
 
34
T. Westerveld. Image retrieval: Content versus context. Content-Based Multimedia Information Access, RIAO, 2000.
35

CITED BY  23
 
 
 

Collaborative Colleagues:
Yi Wu: colleagues
Edward Y. Chang: colleagues
Kevin Chen-Chuan Chang: colleagues
John R. Smith: colleagues