research-article

Cross-Modality Feature Learning via Convolutional Autoencoder

Authors:
Xueliang Liu

Hefei University of Technology, Hefei, Anhui, China

Hefei University of Technology, Hefei, Anhui, China
View Profile

,
Meng Wang

Hefei University of Technology, Hefei, Anhui, China

Hefei University of Technology, Hefei, Anhui, China

0000-0002-3094-7735
View Profile

,
Zheng-Jun Zha

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China
View Profile

,
Richang Hong

Hefei University of Technology, Hefei, Anhui, China

Hefei University of Technology, Hefei, Anhui, China
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15 Issue 1sArticle No.: 7pp 1–20https://doi.org/10.1145/3231740

Published:24 January 2019Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Learning robust and representative features across multiple modalities has been a fundamental problem in machine learning and multimedia fields. In this article, we propose a novel MUltimodal Convolutional AutoEncoder (MUCAE) approach to learn representative features from visual and textual modalities. For each modality, we integrate the convolutional operation into an autoencoder framework to learn a joint representation from the original image and text content. We optimize the convolutional autoencoders of different modalities jointly by exploiting the correlation between the hidden representations from the convolutional autoencoders, in particular by minimizing both the reconstructing error of each modality and the correlation divergence between the hidden feature of different modalities. Compared to the conventional solutions relying on hand-crafted features, the proposed MUCAE approach encodes features from image pixels and text characters directly and produces more representative and robust features. We evaluate MUCAE on cross-media retrieval as well as unimodal classification tasks over real-world large-scale multimedia databases. Experimental results have shown that MUCAE performs better than the state-of-the-arts methods.

References

Guillaume Alain and Yoshua Bengio. 2014. What regularized auto-encoders learn from the data-generating distribution. Journal of Machine Learning Research 15, 1 (Jan. 2014), 3563--3593. Google ScholarDigital Library
Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. 2014. Neural codes for image retrieval. In Proceedings of European Conference on Computer Vision. Vol. 8689. Springer International Publishing, 584--599.Google ScholarCross Ref
Yoshua Bengio, Aaron Courville, and Pierre Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1798--1828. Google ScholarDigital Library
Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle. 2007. Greedy layer-wise training of deep networks. In Proceedings of Advances in Neural Information Processing Systems, Vol. 19. 153. Google ScholarDigital Library
Minmin Chen, Kilian Q. Weinberger, Fei Sha, and Yoshua Bengio. 2014. Marginalized denoising auto-encoders for nonlinear representations. In Proceedings of International Conference on Machine Learning. 1476--1484. Google ScholarDigital Library
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao. Zheng. 2009. NUS-WIDE: A real-world web image database from national university of Singapore. In Proceedings of ACM Conference on Image and Video Retrieval. Google ScholarDigital Library
Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert Lanckriet, Roger Levy, and Nuno Vasconcelos. 2014. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 3 (March 2014), 521--535. Google ScholarDigital Library
Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2013. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013).Google Scholar
Cicero dos Santos and Maira Gatti. 2014. Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of the International Conference on Computational Linguistics: Technical Papers. 69--78.Google Scholar
Vincent Dumoulin and Francesco Visin. 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 (2016).Google Scholar
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1s, Article 26 (Oct. 2015), 22 pages. Google ScholarDigital Library
Shenghua Gao, Liang-Tien Chia, and Ivor Wai-Hung Tsang. 2011. Multi-layer group sparse coding for concurrent image classification and annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2809--2816. Google ScholarDigital Library
Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. 2010. Multimodal semi-supervised learning for image classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 902--909.Google ScholarCross Ref
Junwei Han, Dingwen Zhang, Shifeng Wen, Lei Guo, Tianming Liu, and Xuelong Li. 2016. Two-stage learning to predict human eye fixations via SDAEs. IEEE Transactions on Cybernetics 46, 2 (Feb. 2016), 487--498.Google ScholarCross Ref
David R. Hardoon, Sandor R. Szedmak, and John R. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computing 16, 12 (Dec. 2004), 2639--2664. Google ScholarDigital Library
Richang Hong, Lei Li, Junjie Cai, Dapeng Tao, Meng Wang, and Qi Tian. 2017. Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Transactions on Image Processing 26, 9 (2017), 4128--4138.Google ScholarDigital Library
Richang Hong, Meng Wang, Yue Gao, Dacheng Tao, Xuelong Li, and Xindong Wu. 2014. Image annotation by multiple-instance learning with discriminative feature mapping and selection. IEEE Transactions on Cybernetics 44, 5 (2014), 669--680.Google ScholarCross Ref
Mark J. Huiskes and Michael S. Lew. 2008. The MIR Flickr retrieval evaluation. In Proceedings of ACM Conference on Multimedia Information Retrieval. ACM, New York. Google ScholarDigital Library
Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2015. Processing social media messages in mass emergency: A survey. Computing Surveys 47, 4, Article 67 (June 2015), 67:1--67:38 pages. Google ScholarDigital Library
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (Jan. 2013), 221--231. Google ScholarDigital Library
Qing-Yuan Jiang and Wu-Jun Li. 2016. Deep cross-modal hashing. arXiv preprint arXiv:1602.02255 (2016).Google Scholar
Jungi Kim, Jinseok Nam, and Iryna Gurevych. 2012. Learning semantics with deep belief network for cross-language information retrieval. In Proceedings of International Conference on Computational Linguistics. 579--588.Google Scholar
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1746--1751.Google ScholarCross Ref
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-aware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI'16). 2741--2749. Google ScholarDigital Library
Alex Krizhevsky and Geoffrey E. Hinton. 2011. Using very deep autoencoders for content-based image retrieval. In Proceedings of European Symposium on Artificial Neural Networks.Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems. 1097--1105. Google ScholarDigital Library
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 512 (2015), 436--444.Google ScholarCross Ref
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278--2324.Google ScholarCross Ref
Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. 2016. RNN Fisher vectors for action recognition and image annotation. In European Conference on Computer Vision. Springer, 833--850.Google ScholarCross Ref
Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. 2006. Content-based multimedia information retrieval. ACM Transactions on Multimedia Computing Communications and Applications 2, 1 (2006), 1--19. Google ScholarDigital Library
Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of ACM Conference on Multimedia. 604--611. Google ScholarDigital Library
Hong Li, Yantao Wei, Luoqing Li, and C. L. Philip Chen. 2013. Hierarchical feature extraction with local neural response for image recognition. IEEE Transactions on Cybernetics 43, 2 (April 2013), 412--424.Google Scholar
Yangxi Li, Bo Geng, Dacheng Tao, Zheng-Jun Zha, Linjun Yang, and Chao Xu. 2012. Difficulty guided image retrieval using linear multiple feature embedding. IEEE Transactions on Multimedia 14, 6 (2012), 1618--1630. Google ScholarDigital Library
Xianglong Liu, Yadong Mu, Danchen Zhang, Bo Lang, and Xuelong Li. 2015. Large-scale unsupervised hashing with shared structure learning. IEEE Transactions on Cybernetics 45, 9 (Sept. 2015), 1811--1822.Google ScholarCross Ref
Vijay Mahadevan, Chi Wah Wong, Jose Costa Pereira, Thomas T. Liu, Nuno Vasconcelos, and Lawrence K. Saul. 2011. Maximum covariance unfolding: Manifold learning for bimodal data. In Proceedings of Advances in Neural Information Processing Systems. 918--926. Google ScholarDigital Library
Tao Mei, Yong Rui, Shipeng Li, and Qi Tian. 2014. Multimedia search reranking: A literature survey. Computing Surveys 46, 3 (2014), 57--76. Google ScholarDigital Library
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of International Conference on Machine Learning. Google ScholarDigital Library
Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. 2014. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1717--1724. Google ScholarDigital Library
Xueming Qian, Xian-Sheng Hua, Yuan Yan Tang, and Tao Mei. 2014. Social image tagging with diverse semantics. IEEE Transactions on Cybernetics 44, 12 (Dec. 2014), 2493--2508.Google ScholarCross Ref
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of ACM Conference on Multimedia. 251--260. Google ScholarDigital Library
Graham Rawlinson. 1976. The Significance of Letter Position in Word Recognition. Ph.D. Dissertation. University of Nottingham.Google Scholar
Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks 61 (2015), 85--117. Google ScholarDigital Library
Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops. Google ScholarDigital Library
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
Nitish Srivastava and Ruslan R. Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. In Proceedings of Advances in Neural Information Processing Systems. 2222--2230. Google ScholarDigital Library
Chun Chet Tan. 2008. Autoencoder Neural Networks: A Performance Study Based on Image Recognition, Reconstruction and Compression. Ph.D. Dissertation. Multimedia University.Google Scholar
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016a. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5005--5013.Google ScholarCross Ref
Meng Wang, Xian-Sheng Hua, Richang Hong, Jinhui Tang, Go-Jun Qi, and Yan Song. 2009. Unified video annotation via multigraph learning. IEEE Transactions on Circuits and Systems for Video Technology 19, 5 (2009), 733--746. Google ScholarDigital Library
Meng Wang, Weisheng Li, Dong Liu, Bingbing Ni, Jialie Shen, and Shuicheng Yan. 2015a. Facilitating image search with a scalable and compact semantic mapping. IEEE Transactions on Cybernetics 45, 8 (2015), 1561--1574.Google ScholarCross Ref
Meng Wang, Xueliang Liu, and Xindong Wu. 2015b. Visual classification by -hypergraph modeling. IEEE Transactions on Knowledge and Data Engineering 27, 9 (2015), 2564--2574.Google ScholarDigital Library
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, and Yueting Zhuang. 2016b. Effective deep learning-based multi-modal retrieval. VLDB Journal 25, 1 (2016), 79--101. Google ScholarDigital Library
Fei Wu, Xinyang Jiang, Xi Li, Siliang Tang, Weiming Lu, Zhongfei Zhang, and Yueting Zhuang. 2015. Cross-modal learning to rank via latent joint representation. IEEE Transactions on Image Processing 24, 5 (2015), 1497--1509.Google ScholarDigital Library
Liang Xie, Lei Zhu, and Guoqi Chen. 2016. Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval. Multimedia Tools and Applications 75, 15 (2016), 9185--9204. Google ScholarDigital Library
Hao Xue, Like Xue, and Feng Su. 2015. Multimodal music mood classification by fusion of audio and lyrics. In Proceedings of International Conference on MultiMedia Modeling. Springer, 26--37.Google ScholarCross Ref
Yan Yan, Feiping Nie, Wen Li, Chenqiang Gao, Yi Yang, and Dong Xu. 2016. Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia 18, 12 (2016), 2494--2502. Google ScholarDigital Library
Yi Yang, Dong Xu, Feiping Nie, Jiebo Luo, and Yueting Zhuang. 2009. Ranking with local regression and global alignment for cross media retrieval. In Proceedings of ACM Conference on Multimedia. 175--184. Google ScholarDigital Library
Kai Yu, Tong Zhang, and Yihong Gong. 2009. Nonlinear learning using local coordinate coding. In Proceedings of Advances in Neural Information Processing Systems, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Eds.). 2223--2231. Google ScholarDigital Library
Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of European Conference on Computer Vision. Springer, 818--833.Google Scholar
Hanwang Zhang, Xindi Shang, Huanbo Luan, Meng Wang, and Tat-Seng Chua. 2016. Learning from collective intelligence: Feature learning using social images and tags. ACM Transactions on Multimedia Computing, Communications and Applications 13, 1, Article 1 (Nov. 2016), 23 pages. Google ScholarDigital Library
Luming Zhang, Yue Gao, Chaoqun Hong, Yinfu Feng, Jianke Zhu, and Deng Cai. 2014. Feature correlation hypergraph: Exploiting high-order potentials for multimodal recognition. IEEE Transactions on Cybernetics 44, 8 (2014), 1408--1419.Google ScholarCross Ref
Xiang Zhang and Yann LeCun. 2015. Text understanding from scratch. CoRR abs/1502.01710 (2015).Google Scholar
Fang Zhao, Yongzhen Huang, Liang Wang, Tao Xiang, and Tieniu Tan. 2016. Learning relevance restricted Boltzmann machine for unstructured group activity and event understanding. International Journal of Computer Vision 3 (2016), 1--17. Google ScholarDigital Library
Liang Zheng, Shengjin Wang, and Qi Tian. 2014. Coupled binary embedding for large-scale image retrieval. IEEE Transactions on Image Processing 23, 8 (2014), 3368--3380.Google ScholarCross Ref
Liang Zheng, Yi Yang, and Qi Tian. 2018. SIFT meets CNN: A decade survey of instance retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 5 (2018), 1224--1244.Google ScholarCross Ref
Lin Zhong, Qingshan Liu, Peng Yang, Junzhou Huang, and Dimitris N. Metaxas. 2015. Learning multiscale active facial patches for expression analysis. IEEE Transactions on Cybernetics 45, 8 (Aug. 2015), 1499--1510.Google Scholar
Yue-Ting Zhuang, Yi Yang, and Fei Wu. 2008. Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Transactions on Multimedia 10, 2 (Feb. 2008), 221--229. Google ScholarDigital Library

Index Terms

Cross-Modality Feature Learning via Convolutional Autoencoder
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection

Recommendations

Feature learning using convolutional denoising autoencoder for activity recognition
Abstract
Wearable technology offers a prospective solution to the increasing demand for activity monitoring in pervasive healthcare. Feature extraction and selection are crucial steps in activity recognition since it determines the accuracy of activity ...
Read More
Short Text Clustering Using Joint Optimization of Feature Representations and Cluster Assignments
PRICAI 2021: Trends in Artificial Intelligence
Abstract
The application of traditional text clustering methods to short text data is inefficient owing to the high dimensionality and semantic sparseness of such data. Contrastingly, convolutional neural networks can capture the local information between ...
Read More
Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning
ICMR '15: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval

Cross-modal retrieval extends the ability of search engines to deal with the massive cross-modal data. The goal of image-text cross-modal retrieval is to search images (texts) by using text (image) queries by computing the similarities of images and ...
Read More

Reviews

Reviewer: Kalman Balogh

This paper contributes to a hot research area that is the focus of many scientists, developers, and large corporations. The reason for the interest is that many important systems, for instance, ones for social media or data collection, produce large-scale multimedia datasets. Investigation by so-called "handcrafted features" becomes unsuitable for many non-numeric data types, such as text or pictures. For many non-numeric data types, interesting features can be learned from the data itself. Different kinds of cross-modal feature learning are used in heterogeneous datasets/data stream analysis. Deep learning methods, among others, have been developed both for auto-encoding a data type (aiming at feature learning) and for attuned analysis of the determined component features of heterogeneous data. For this purpose, the authors develop a sophisticated convolutional neural network (CNN), called multimodal convolutional autoencoder (MUCAE), and further develop some existing architectures. They use learning representative features from two modalities-pictures represented by image pixels, and text characters-to evaluate the method. To exploit the correlation between the hidden representations from the two modalities, the unified framework integrates an autoencoder and an objective function. The system jointly minimizes the representation learning error of each modality and the correlation divergence between different modalities. The authors define the problem and describe the solution on an abstract level, showing the mathematical thoughts and the 11 levels of their CNN. There is no reference to the environment of the implementation; one can presume only that some of the powerful and popular tools and packages are used. Some related work on multimodal, supervised, and unsupervised deep feature learning is enumerated. The paper contains precise figures about the efficiency of the implementation on two datasets: MIRFlickr, and a subset of NUS-WIDE. These results are compared to those of five former systems developed over the past decade. According to these results, MUCAE outperformed the others by two to ten percent for joint character-picture data analysis. The main parameters used in the algorithm are discussed. The behavior of the method depending on the size of the input dataset is not investigated. By concretizing a bit of the essence of the abstract, the paper's conclusion summarizes the approach, the method, the experiments, and the results. Neither intended (or further) developments of the method nor future directions are discussed. I recommend the paper only for active specialists in the area.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15, Issue 1s
Special Section on Deep Learning for Intelligent Multimedia Analytics and Special Section on Multi-Modal Understanding of Social, Affective and Subjective Attributes of Data
January 2019
265 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3309769
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 January 2019
- Accepted: 1 June 2018
- Revised: 1 April 2018
- Received: 1 October 2017
Published in tomm Volume 15, Issue 1s

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cross modality
convolutional autoencoder
feature learning
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 24
  Total Citations
  View Citations
- 642
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Cross-Modality Feature Learning via Convolutional Autoencoder

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Feature learning using convolutional denoising autoencoder for activity recognition

Short Text Clustering Using Joint Optimization of Feature Representations and Cluster Assignments

Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning

Reviews

Access critical reviews of Computing literature here