ABSTRACT
The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual (SIFT) features. Two hypotheses are investigated: that 1) there is a benefit to explicitly modeling correlations between the two components, and 2) this modeling is more effective in feature spaces with higher levels of abstraction. Correlations between the two components are learned with canonical correlation analysis. Abstraction is achieved by representing text and images at a more general, semantic level. The two hypotheses are studied in the context of the task of cross-modal document retrieval. This includes retrieving the text that most closely matches a query image, or retrieving the images that most closely match a query text. It is shown that accounting for cross-modal correlations and semantic abstraction both improve retrieval accuracy. The cross-modal model is also shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.
- K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. Blei, and M. Jordan. Matching words and pictures. JMLR, 3:1107--1135, 2003. Google ScholarDigital Library
- D. Blei and M. Jordan. Modeling annotated data. In Proceedings of the 26th annual international ACM SIGIR conference, pages 127--134. ACM, 2003. Google ScholarDigital Library
- D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993--1022, 2003. Google ScholarDigital Library
- G. Carneiro, A. Chan, P. Moreno, and N. Vasconcelos. Supervised learning of semantic classes for image annotation and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(3):394--410, 2007. Google ScholarDigital Library
- G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, ECCV, volume 1, page 22. Citeseer, 2004.Google Scholar
- R. Datta, D. Joshi, J. Li, and J. Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (CSUR), 40(2):1--60, 2008. Google ScholarDigital Library
- S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391--407, 1990.Google Scholar
- H. Escalante, C. H´ernadez, L. Sucar, and M. Montes. Late fusion of heterogeneous methods for multimedia image retrieval. In Proceeding of the 1st ACM international conference on Multimedia information retrieval, pages 172--179. ACM, 2008. Google ScholarDigital Library
- S. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In CVPR, volume 2, 2004. Google ScholarDigital Library
- D. Hosmer and S. Lemeshow. Applied logistic regression. Wiley-Interscience, 2000.Google ScholarCross Ref
- H. Hotelling. Relations between two sets of variates. Biometrika, 28:321--377, 1936.Google ScholarCross Ref
- J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th annual international ACM SIGIR conference, page 126. ACM, 2003. Google ScholarDigital Library
- I. Jolliffe. Principal component analysis. Springer verlag, 2002.Google Scholar
- T. Kliegr, K. Chandramouli, J. Nemrava, V. Svatek, and E. Izquierdo. Combining image captions and visual analysis for image concept classification. In Proceedings of the 9th International Workshop on Multimedia Data Mining at ACM SIGKDD 2008, pages 8--17. ACM New York, NY, USA, 2008. Google ScholarDigital Library
- A. Laub. Matrix analysis for scientists and engineers. Siam, 2005. Google ScholarDigital Library
- V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures. In NIPS, 2003.Google Scholar
- D. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91--110, 2004. Google ScholarDigital Library
- C. Meadow, B. Boyce, D. Kraft, and C. Barry. Text information retrieval systems. Emerald Group Pub Ltd, 2007. Google ScholarDigital Library
- F. Monay and D. Gatica-Perez. Modeling semantic aspects for cross-media image indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(10):1802--1817, 2007. Google ScholarDigital Library
- A. Nakagawa, A. Kutics, K. Tanaka, and M. Nakajima. Combining words and object-based visual features in image retrieval. In Proceedings 12th International Conference on Image Analysis and Processing, pages 354--359, 2003. Google ScholarDigital Library
- M. Paramita, M. Sanderson, and P. Clough. Diversity in photo retrieval: overview of the Image CLEF Photo task 2009. CLEF working notes, 2009. Google ScholarDigital Library
- T. Pham, N. Maillot, J. Lim, and J. Chevallet. Latent semantic fusion model for image retrieval and annotation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 439--444. ACM, 2007. Google ScholarDigital Library
- A. Quattoni, M. Collins, T. Darrell, and C. MIT. Learning visual representations using images with captions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1--8, 2007.Google ScholarCross Ref
- J. Ramsay and B. Silverman. Functional Data Analysis. Springer, 1997.Google ScholarCross Ref
- N. Rasiwasia, P. Moreno, and N. Vasconcelos. Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia, 9(5):923--938, 2007. Google ScholarDigital Library
- A. F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns and trecvid. In MIR '06: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pages 321--330, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on pattern analysis and machine intelligence, 22(12):1349--1380, 2000. Google ScholarDigital Library
- C. Snoek and M. Worring. Multimodal video indexing: A review of the state-of-the-art. Multimedia Tools and Applications, 25(1):5--35, 2005. Google ScholarDigital Library
- T. Tsikrika and J. Kludas. Overview of the wikipedia MM task at Image CLEF 2009. In Working Notes for the CLEF 2009 Workshop, 2009. Google ScholarDigital Library
- N. Vasconcelos. Minimum probability of error image retrieval. IEEE Transactions on Signal Processing, 52(8):2322--2336, 2004. Google ScholarDigital Library
- G. Wang, D. Hoiem, and D. Forsyth. Building text features for object image classification. In Proceedings of 19th international conference on pattern recognition, 2009.Google ScholarCross Ref
- T. Westerveld. Probabilistic multimedia retrieval. In Proceedings of the 25th annual international ACM SIGIR conference, page 438. ACM, 2002. Google ScholarDigital Library
Index Terms
A new approach to cross-modal multimedia retrieval
Recommendations
Cross-modal Retrieval with Correspondence Autoencoder
MM '14: Proceedings of the 22nd ACM international conference on MultimediaThe problem of cross-modal retrieval, e.g., using a text query to search for images and vice-versa, is considered in this paper. A novel model involving correspondence autoencoder (Corr-AE) is proposed here for solving this problem. The model is ...
On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval
The problem of cross-modal retrieval from multimedia repositories is considered. This problem addresses the design of retrieval systems that support queries across content modalities, for example, using an image to search for texts. A mathematical ...
Effective heterogeneous similarity measure with nearest neighbors for cross-media retrieval
MMM'12: Proceedings of the 18th international conference on Advances in Multimedia ModelingEmerging multimedia content including images and texts are always jointly utilized to describe the same semantics. As a result, cross-media retrieval becomes increasingly important, which is able to retrieve the results of the same semantics with the ...
Comments