skip to main content
10.1145/1873951.1873987acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

A new approach to cross-modal multimedia retrieval

Published:25 October 2010Publication History

ABSTRACT

The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual (SIFT) features. Two hypotheses are investigated: that 1) there is a benefit to explicitly modeling correlations between the two components, and 2) this modeling is more effective in feature spaces with higher levels of abstraction. Correlations between the two components are learned with canonical correlation analysis. Abstraction is achieved by representing text and images at a more general, semantic level. The two hypotheses are studied in the context of the task of cross-modal document retrieval. This includes retrieving the text that most closely matches a query image, or retrieving the images that most closely match a query text. It is shown that accounting for cross-modal correlations and semantic abstraction both improve retrieval accuracy. The cross-modal model is also shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.

References

  1. K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. Blei, and M. Jordan. Matching words and pictures. JMLR, 3:1107--1135, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Blei and M. Jordan. Modeling annotated data. In Proceedings of the 26th annual international ACM SIGIR conference, pages 127--134. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Carneiro, A. Chan, P. Moreno, and N. Vasconcelos. Supervised learning of semantic classes for image annotation and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(3):394--410, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, ECCV, volume 1, page 22. Citeseer, 2004.Google ScholarGoogle Scholar
  6. R. Datta, D. Joshi, J. Li, and J. Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (CSUR), 40(2):1--60, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391--407, 1990.Google ScholarGoogle Scholar
  8. H. Escalante, C. H´ernadez, L. Sucar, and M. Montes. Late fusion of heterogeneous methods for multimedia image retrieval. In Proceeding of the 1st ACM international conference on Multimedia information retrieval, pages 172--179. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In CVPR, volume 2, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Hosmer and S. Lemeshow. Applied logistic regression. Wiley-Interscience, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  11. H. Hotelling. Relations between two sets of variates. Biometrika, 28:321--377, 1936.Google ScholarGoogle ScholarCross RefCross Ref
  12. J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th annual international ACM SIGIR conference, page 126. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. I. Jolliffe. Principal component analysis. Springer verlag, 2002.Google ScholarGoogle Scholar
  14. T. Kliegr, K. Chandramouli, J. Nemrava, V. Svatek, and E. Izquierdo. Combining image captions and visual analysis for image concept classification. In Proceedings of the 9th International Workshop on Multimedia Data Mining at ACM SIGKDD 2008, pages 8--17. ACM New York, NY, USA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Laub. Matrix analysis for scientists and engineers. Siam, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures. In NIPS, 2003.Google ScholarGoogle Scholar
  17. D. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91--110, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Meadow, B. Boyce, D. Kraft, and C. Barry. Text information retrieval systems. Emerald Group Pub Ltd, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Monay and D. Gatica-Perez. Modeling semantic aspects for cross-media image indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(10):1802--1817, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Nakagawa, A. Kutics, K. Tanaka, and M. Nakajima. Combining words and object-based visual features in image retrieval. In Proceedings 12th International Conference on Image Analysis and Processing, pages 354--359, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Paramita, M. Sanderson, and P. Clough. Diversity in photo retrieval: overview of the Image CLEF Photo task 2009. CLEF working notes, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Pham, N. Maillot, J. Lim, and J. Chevallet. Latent semantic fusion model for image retrieval and annotation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 439--444. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Quattoni, M. Collins, T. Darrell, and C. MIT. Learning visual representations using images with captions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1--8, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  24. J. Ramsay and B. Silverman. Functional Data Analysis. Springer, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  25. N. Rasiwasia, P. Moreno, and N. Vasconcelos. Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia, 9(5):923--938, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns and trecvid. In MIR '06: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pages 321--330, New York, NY, USA, 2006. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on pattern analysis and machine intelligence, 22(12):1349--1380, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Snoek and M. Worring. Multimodal video indexing: A review of the state-of-the-art. Multimedia Tools and Applications, 25(1):5--35, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Tsikrika and J. Kludas. Overview of the wikipedia MM task at Image CLEF 2009. In Working Notes for the CLEF 2009 Workshop, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. N. Vasconcelos. Minimum probability of error image retrieval. IEEE Transactions on Signal Processing, 52(8):2322--2336, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. Wang, D. Hoiem, and D. Forsyth. Building text features for object image classification. In Proceedings of 19th international conference on pattern recognition, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  32. T. Westerveld. Probabilistic multimedia retrieval. In Proceedings of the 25th annual international ACM SIGIR conference, page 438. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A new approach to cross-modal multimedia retrieval

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            MM '10: Proceedings of the 18th ACM international conference on Multimedia
            October 2010
            1836 pages
            ISBN:9781605589336
            DOI:10.1145/1873951

            Copyright © 2010 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 25 October 2010

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate995of4,171submissions,24%

            Upcoming Conference

            MM '24
            MM '24: The 32nd ACM International Conference on Multimedia
            October 28 - November 1, 2024
            Melbourne , VIC , Australia

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader