skip to main content
research-article

Cross-Modality Feature Learning via Convolutional Autoencoder

Published:24 January 2019Publication History
Skip Abstract Section

Abstract

Learning robust and representative features across multiple modalities has been a fundamental problem in machine learning and multimedia fields. In this article, we propose a novel MUltimodal Convolutional AutoEncoder (MUCAE) approach to learn representative features from visual and textual modalities. For each modality, we integrate the convolutional operation into an autoencoder framework to learn a joint representation from the original image and text content. We optimize the convolutional autoencoders of different modalities jointly by exploiting the correlation between the hidden representations from the convolutional autoencoders, in particular by minimizing both the reconstructing error of each modality and the correlation divergence between the hidden feature of different modalities. Compared to the conventional solutions relying on hand-crafted features, the proposed MUCAE approach encodes features from image pixels and text characters directly and produces more representative and robust features. We evaluate MUCAE on cross-media retrieval as well as unimodal classification tasks over real-world large-scale multimedia databases. Experimental results have shown that MUCAE performs better than the state-of-the-arts methods.

References

  1. Guillaume Alain and Yoshua Bengio. 2014. What regularized auto-encoders learn from the data-generating distribution. Journal of Machine Learning Research 15, 1 (Jan. 2014), 3563--3593. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. 2014. Neural codes for image retrieval. In Proceedings of European Conference on Computer Vision. Vol. 8689. Springer International Publishing, 584--599.Google ScholarGoogle ScholarCross RefCross Ref
  3. Yoshua Bengio, Aaron Courville, and Pierre Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1798--1828. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle. 2007. Greedy layer-wise training of deep networks. In Proceedings of Advances in Neural Information Processing Systems, Vol. 19. 153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Minmin Chen, Kilian Q. Weinberger, Fei Sha, and Yoshua Bengio. 2014. Marginalized denoising auto-encoders for nonlinear representations. In Proceedings of International Conference on Machine Learning. 1476--1484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao. Zheng. 2009. NUS-WIDE: A real-world web image database from national university of Singapore. In Proceedings of ACM Conference on Image and Video Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert Lanckriet, Roger Levy, and Nuno Vasconcelos. 2014. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 3 (March 2014), 521--535. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2013. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013).Google ScholarGoogle Scholar
  9. Cicero dos Santos and Maira Gatti. 2014. Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of the International Conference on Computational Linguistics: Technical Papers. 69--78.Google ScholarGoogle Scholar
  10. Vincent Dumoulin and Francesco Visin. 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 (2016).Google ScholarGoogle Scholar
  11. Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  12. Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1s, Article 26 (Oct. 2015), 22 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Shenghua Gao, Liang-Tien Chia, and Ivor Wai-Hung Tsang. 2011. Multi-layer group sparse coding for concurrent image classification and annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2809--2816. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. 2010. Multimodal semi-supervised learning for image classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 902--909.Google ScholarGoogle ScholarCross RefCross Ref
  15. Junwei Han, Dingwen Zhang, Shifeng Wen, Lei Guo, Tianming Liu, and Xuelong Li. 2016. Two-stage learning to predict human eye fixations via SDAEs. IEEE Transactions on Cybernetics 46, 2 (Feb. 2016), 487--498.Google ScholarGoogle ScholarCross RefCross Ref
  16. David R. Hardoon, Sandor R. Szedmak, and John R. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computing 16, 12 (Dec. 2004), 2639--2664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Richang Hong, Lei Li, Junjie Cai, Dapeng Tao, Meng Wang, and Qi Tian. 2017. Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Transactions on Image Processing 26, 9 (2017), 4128--4138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Richang Hong, Meng Wang, Yue Gao, Dacheng Tao, Xuelong Li, and Xindong Wu. 2014. Image annotation by multiple-instance learning with discriminative feature mapping and selection. IEEE Transactions on Cybernetics 44, 5 (2014), 669--680.Google ScholarGoogle ScholarCross RefCross Ref
  19. Mark J. Huiskes and Michael S. Lew. 2008. The MIR Flickr retrieval evaluation. In Proceedings of ACM Conference on Multimedia Information Retrieval. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2015. Processing social media messages in mass emergency: A survey. Computing Surveys 47, 4, Article 67 (June 2015), 67:1--67:38 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (Jan. 2013), 221--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Qing-Yuan Jiang and Wu-Jun Li. 2016. Deep cross-modal hashing. arXiv preprint arXiv:1602.02255 (2016).Google ScholarGoogle Scholar
  23. Jungi Kim, Jinseok Nam, and Iryna Gurevych. 2012. Learning semantics with deep belief network for cross-language information retrieval. In Proceedings of International Conference on Computational Linguistics. 579--588.Google ScholarGoogle Scholar
  24. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1746--1751.Google ScholarGoogle ScholarCross RefCross Ref
  25. Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-aware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI'16). 2741--2749. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Alex Krizhevsky and Geoffrey E. Hinton. 2011. Using very deep autoencoders for content-based image retrieval. In Proceedings of European Symposium on Artificial Neural Networks.Google ScholarGoogle Scholar
  27. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 512 (2015), 436--444.Google ScholarGoogle ScholarCross RefCross Ref
  29. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  30. Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. 2016. RNN Fisher vectors for action recognition and image annotation. In European Conference on Computer Vision. Springer, 833--850.Google ScholarGoogle ScholarCross RefCross Ref
  31. Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. 2006. Content-based multimedia information retrieval. ACM Transactions on Multimedia Computing Communications and Applications 2, 1 (2006), 1--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of ACM Conference on Multimedia. 604--611. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Hong Li, Yantao Wei, Luoqing Li, and C. L. Philip Chen. 2013. Hierarchical feature extraction with local neural response for image recognition. IEEE Transactions on Cybernetics 43, 2 (April 2013), 412--424.Google ScholarGoogle Scholar
  34. Yangxi Li, Bo Geng, Dacheng Tao, Zheng-Jun Zha, Linjun Yang, and Chao Xu. 2012. Difficulty guided image retrieval using linear multiple feature embedding. IEEE Transactions on Multimedia 14, 6 (2012), 1618--1630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xianglong Liu, Yadong Mu, Danchen Zhang, Bo Lang, and Xuelong Li. 2015. Large-scale unsupervised hashing with shared structure learning. IEEE Transactions on Cybernetics 45, 9 (Sept. 2015), 1811--1822.Google ScholarGoogle ScholarCross RefCross Ref
  36. Vijay Mahadevan, Chi Wah Wong, Jose Costa Pereira, Thomas T. Liu, Nuno Vasconcelos, and Lawrence K. Saul. 2011. Maximum covariance unfolding: Manifold learning for bimodal data. In Proceedings of Advances in Neural Information Processing Systems. 918--926. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Tao Mei, Yong Rui, Shipeng Li, and Qi Tian. 2014. Multimedia search reranking: A literature survey. Computing Surveys 46, 3 (2014), 57--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google ScholarGoogle Scholar
  39. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of International Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. 2014. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1717--1724. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Xueming Qian, Xian-Sheng Hua, Yuan Yan Tang, and Tao Mei. 2014. Social image tagging with diverse semantics. IEEE Transactions on Cybernetics 44, 12 (Dec. 2014), 2493--2508.Google ScholarGoogle ScholarCross RefCross Ref
  42. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of ACM Conference on Multimedia. 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Graham Rawlinson. 1976. The Significance of Letter Position in Word Recognition. Ph.D. Dissertation. University of Nottingham.Google ScholarGoogle Scholar
  44. Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks 61 (2015), 85--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  47. Nitish Srivastava and Ruslan R. Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. In Proceedings of Advances in Neural Information Processing Systems. 2222--2230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Chun Chet Tan. 2008. Autoencoder Neural Networks: A Performance Study Based on Image Recognition, Reconstruction and Compression. Ph.D. Dissertation. Multimedia University.Google ScholarGoogle Scholar
  49. Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016a. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5005--5013.Google ScholarGoogle ScholarCross RefCross Ref
  50. Meng Wang, Xian-Sheng Hua, Richang Hong, Jinhui Tang, Go-Jun Qi, and Yan Song. 2009. Unified video annotation via multigraph learning. IEEE Transactions on Circuits and Systems for Video Technology 19, 5 (2009), 733--746. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Meng Wang, Weisheng Li, Dong Liu, Bingbing Ni, Jialie Shen, and Shuicheng Yan. 2015a. Facilitating image search with a scalable and compact semantic mapping. IEEE Transactions on Cybernetics 45, 8 (2015), 1561--1574.Google ScholarGoogle ScholarCross RefCross Ref
  52. Meng Wang, Xueliang Liu, and Xindong Wu. 2015b. Visual classification by -hypergraph modeling. IEEE Transactions on Knowledge and Data Engineering 27, 9 (2015), 2564--2574.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, and Yueting Zhuang. 2016b. Effective deep learning-based multi-modal retrieval. VLDB Journal 25, 1 (2016), 79--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Fei Wu, Xinyang Jiang, Xi Li, Siliang Tang, Weiming Lu, Zhongfei Zhang, and Yueting Zhuang. 2015. Cross-modal learning to rank via latent joint representation. IEEE Transactions on Image Processing 24, 5 (2015), 1497--1509.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Liang Xie, Lei Zhu, and Guoqi Chen. 2016. Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval. Multimedia Tools and Applications 75, 15 (2016), 9185--9204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Hao Xue, Like Xue, and Feng Su. 2015. Multimodal music mood classification by fusion of audio and lyrics. In Proceedings of International Conference on MultiMedia Modeling. Springer, 26--37.Google ScholarGoogle ScholarCross RefCross Ref
  57. Yan Yan, Feiping Nie, Wen Li, Chenqiang Gao, Yi Yang, and Dong Xu. 2016. Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia 18, 12 (2016), 2494--2502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Yi Yang, Dong Xu, Feiping Nie, Jiebo Luo, and Yueting Zhuang. 2009. Ranking with local regression and global alignment for cross media retrieval. In Proceedings of ACM Conference on Multimedia. 175--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Kai Yu, Tong Zhang, and Yihong Gong. 2009. Nonlinear learning using local coordinate coding. In Proceedings of Advances in Neural Information Processing Systems, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Eds.). 2223--2231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of European Conference on Computer Vision. Springer, 818--833.Google ScholarGoogle Scholar
  61. Hanwang Zhang, Xindi Shang, Huanbo Luan, Meng Wang, and Tat-Seng Chua. 2016. Learning from collective intelligence: Feature learning using social images and tags. ACM Transactions on Multimedia Computing, Communications and Applications 13, 1, Article 1 (Nov. 2016), 23 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Luming Zhang, Yue Gao, Chaoqun Hong, Yinfu Feng, Jianke Zhu, and Deng Cai. 2014. Feature correlation hypergraph: Exploiting high-order potentials for multimodal recognition. IEEE Transactions on Cybernetics 44, 8 (2014), 1408--1419.Google ScholarGoogle ScholarCross RefCross Ref
  63. Xiang Zhang and Yann LeCun. 2015. Text understanding from scratch. CoRR abs/1502.01710 (2015).Google ScholarGoogle Scholar
  64. Fang Zhao, Yongzhen Huang, Liang Wang, Tao Xiang, and Tieniu Tan. 2016. Learning relevance restricted Boltzmann machine for unstructured group activity and event understanding. International Journal of Computer Vision 3 (2016), 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Liang Zheng, Shengjin Wang, and Qi Tian. 2014. Coupled binary embedding for large-scale image retrieval. IEEE Transactions on Image Processing 23, 8 (2014), 3368--3380.Google ScholarGoogle ScholarCross RefCross Ref
  66. Liang Zheng, Yi Yang, and Qi Tian. 2018. SIFT meets CNN: A decade survey of instance retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 5 (2018), 1224--1244.Google ScholarGoogle ScholarCross RefCross Ref
  67. Lin Zhong, Qingshan Liu, Peng Yang, Junzhou Huang, and Dimitris N. Metaxas. 2015. Learning multiscale active facial patches for expression analysis. IEEE Transactions on Cybernetics 45, 8 (Aug. 2015), 1499--1510.Google ScholarGoogle Scholar
  68. Yue-Ting Zhuang, Yi Yang, and Fei Wu. 2008. Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Transactions on Multimedia 10, 2 (Feb. 2008), 221--229. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cross-Modality Feature Learning via Convolutional Autoencoder

      Recommendations

      Reviews

      Kalman Balogh

      This paper contributes to a hot research area that is the focus of many scientists, developers, and large corporations. The reason for the interest is that many important systems, for instance, ones for social media or data collection, produce large-scale multimedia datasets. Investigation by so-called "handcrafted features" becomes unsuitable for many non-numeric data types, such as text or pictures. For many non-numeric data types, interesting features can be learned from the data itself. Different kinds of cross-modal feature learning are used in heterogeneous datasets/data stream analysis. Deep learning methods, among others, have been developed both for auto-encoding a data type (aiming at feature learning) and for attuned analysis of the determined component features of heterogeneous data. For this purpose, the authors develop a sophisticated convolutional neural network (CNN), called multimodal convolutional autoencoder (MUCAE), and further develop some existing architectures. They use learning representative features from two modalities-pictures represented by image pixels, and text characters-to evaluate the method. To exploit the correlation between the hidden representations from the two modalities, the unified framework integrates an autoencoder and an objective function. The system jointly minimizes the representation learning error of each modality and the correlation divergence between different modalities. The authors define the problem and describe the solution on an abstract level, showing the mathematical thoughts and the 11 levels of their CNN. There is no reference to the environment of the implementation; one can presume only that some of the powerful and popular tools and packages are used. Some related work on multimodal, supervised, and unsupervised deep feature learning is enumerated. The paper contains precise figures about the efficiency of the implementation on two datasets: MIRFlickr, and a subset of NUS-WIDE. These results are compared to those of five former systems developed over the past decade. According to these results, MUCAE outperformed the others by two to ten percent for joint character-picture data analysis. The main parameters used in the algorithm are discussed. The behavior of the method depending on the size of the input dataset is not investigated. By concretizing a bit of the essence of the abstract, the paper's conclusion summarizes the approach, the method, the experiments, and the results. Neither intended (or further) developments of the method nor future directions are discussed. I recommend the paper only for active specialists in the area.

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 1s
        Special Section on Deep Learning for Intelligent Multimedia Analytics and Special Section on Multi-Modal Understanding of Social, Affective and Subjective Attributes of Data
        January 2019
        265 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3309769
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 January 2019
        • Accepted: 1 June 2018
        • Revised: 1 April 2018
        • Received: 1 October 2017
        Published in tomm Volume 15, Issue 1s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format