Abstract
Learning robust and representative features across multiple modalities has been a fundamental problem in machine learning and multimedia fields. In this article, we propose a novel MUltimodal Convolutional AutoEncoder (MUCAE) approach to learn representative features from visual and textual modalities. For each modality, we integrate the convolutional operation into an autoencoder framework to learn a joint representation from the original image and text content. We optimize the convolutional autoencoders of different modalities jointly by exploiting the correlation between the hidden representations from the convolutional autoencoders, in particular by minimizing both the reconstructing error of each modality and the correlation divergence between the hidden feature of different modalities. Compared to the conventional solutions relying on hand-crafted features, the proposed MUCAE approach encodes features from image pixels and text characters directly and produces more representative and robust features. We evaluate MUCAE on cross-media retrieval as well as unimodal classification tasks over real-world large-scale multimedia databases. Experimental results have shown that MUCAE performs better than the state-of-the-arts methods.
- Guillaume Alain and Yoshua Bengio. 2014. What regularized auto-encoders learn from the data-generating distribution. Journal of Machine Learning Research 15, 1 (Jan. 2014), 3563--3593. Google ScholarDigital Library
- Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. 2014. Neural codes for image retrieval. In Proceedings of European Conference on Computer Vision. Vol. 8689. Springer International Publishing, 584--599.Google ScholarCross Ref
- Yoshua Bengio, Aaron Courville, and Pierre Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1798--1828. Google ScholarDigital Library
- Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle. 2007. Greedy layer-wise training of deep networks. In Proceedings of Advances in Neural Information Processing Systems, Vol. 19. 153. Google ScholarDigital Library
- Minmin Chen, Kilian Q. Weinberger, Fei Sha, and Yoshua Bengio. 2014. Marginalized denoising auto-encoders for nonlinear representations. In Proceedings of International Conference on Machine Learning. 1476--1484. Google ScholarDigital Library
- Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao. Zheng. 2009. NUS-WIDE: A real-world web image database from national university of Singapore. In Proceedings of ACM Conference on Image and Video Retrieval. Google ScholarDigital Library
- Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert Lanckriet, Roger Levy, and Nuno Vasconcelos. 2014. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 3 (March 2014), 521--535. Google ScholarDigital Library
- Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2013. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013).Google Scholar
- Cicero dos Santos and Maira Gatti. 2014. Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of the International Conference on Computational Linguistics: Technical Papers. 69--78.Google Scholar
- Vincent Dumoulin and Francesco Visin. 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 (2016).Google Scholar
- Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
- Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1s, Article 26 (Oct. 2015), 22 pages. Google ScholarDigital Library
- Shenghua Gao, Liang-Tien Chia, and Ivor Wai-Hung Tsang. 2011. Multi-layer group sparse coding for concurrent image classification and annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2809--2816. Google ScholarDigital Library
- Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. 2010. Multimodal semi-supervised learning for image classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 902--909.Google ScholarCross Ref
- Junwei Han, Dingwen Zhang, Shifeng Wen, Lei Guo, Tianming Liu, and Xuelong Li. 2016. Two-stage learning to predict human eye fixations via SDAEs. IEEE Transactions on Cybernetics 46, 2 (Feb. 2016), 487--498.Google ScholarCross Ref
- David R. Hardoon, Sandor R. Szedmak, and John R. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computing 16, 12 (Dec. 2004), 2639--2664. Google ScholarDigital Library
- Richang Hong, Lei Li, Junjie Cai, Dapeng Tao, Meng Wang, and Qi Tian. 2017. Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Transactions on Image Processing 26, 9 (2017), 4128--4138.Google ScholarDigital Library
- Richang Hong, Meng Wang, Yue Gao, Dacheng Tao, Xuelong Li, and Xindong Wu. 2014. Image annotation by multiple-instance learning with discriminative feature mapping and selection. IEEE Transactions on Cybernetics 44, 5 (2014), 669--680.Google ScholarCross Ref
- Mark J. Huiskes and Michael S. Lew. 2008. The MIR Flickr retrieval evaluation. In Proceedings of ACM Conference on Multimedia Information Retrieval. ACM, New York. Google ScholarDigital Library
- Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2015. Processing social media messages in mass emergency: A survey. Computing Surveys 47, 4, Article 67 (June 2015), 67:1--67:38 pages. Google ScholarDigital Library
- Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (Jan. 2013), 221--231. Google ScholarDigital Library
- Qing-Yuan Jiang and Wu-Jun Li. 2016. Deep cross-modal hashing. arXiv preprint arXiv:1602.02255 (2016).Google Scholar
- Jungi Kim, Jinseok Nam, and Iryna Gurevych. 2012. Learning semantics with deep belief network for cross-language information retrieval. In Proceedings of International Conference on Computational Linguistics. 579--588.Google Scholar
- Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1746--1751.Google ScholarCross Ref
- Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-aware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI'16). 2741--2749. Google ScholarDigital Library
- Alex Krizhevsky and Geoffrey E. Hinton. 2011. Using very deep autoencoders for content-based image retrieval. In Proceedings of European Symposium on Artificial Neural Networks.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems. 1097--1105. Google ScholarDigital Library
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 512 (2015), 436--444.Google ScholarCross Ref
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278--2324.Google ScholarCross Ref
- Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. 2016. RNN Fisher vectors for action recognition and image annotation. In European Conference on Computer Vision. Springer, 833--850.Google ScholarCross Ref
- Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. 2006. Content-based multimedia information retrieval. ACM Transactions on Multimedia Computing Communications and Applications 2, 1 (2006), 1--19. Google ScholarDigital Library
- Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of ACM Conference on Multimedia. 604--611. Google ScholarDigital Library
- Hong Li, Yantao Wei, Luoqing Li, and C. L. Philip Chen. 2013. Hierarchical feature extraction with local neural response for image recognition. IEEE Transactions on Cybernetics 43, 2 (April 2013), 412--424.Google Scholar
- Yangxi Li, Bo Geng, Dacheng Tao, Zheng-Jun Zha, Linjun Yang, and Chao Xu. 2012. Difficulty guided image retrieval using linear multiple feature embedding. IEEE Transactions on Multimedia 14, 6 (2012), 1618--1630. Google ScholarDigital Library
- Xianglong Liu, Yadong Mu, Danchen Zhang, Bo Lang, and Xuelong Li. 2015. Large-scale unsupervised hashing with shared structure learning. IEEE Transactions on Cybernetics 45, 9 (Sept. 2015), 1811--1822.Google ScholarCross Ref
- Vijay Mahadevan, Chi Wah Wong, Jose Costa Pereira, Thomas T. Liu, Nuno Vasconcelos, and Lawrence K. Saul. 2011. Maximum covariance unfolding: Manifold learning for bimodal data. In Proceedings of Advances in Neural Information Processing Systems. 918--926. Google ScholarDigital Library
- Tao Mei, Yong Rui, Shipeng Li, and Qi Tian. 2014. Multimedia search reranking: A literature survey. Computing Surveys 46, 3 (2014), 57--76. Google ScholarDigital Library
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of International Conference on Machine Learning. Google ScholarDigital Library
- Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. 2014. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1717--1724. Google ScholarDigital Library
- Xueming Qian, Xian-Sheng Hua, Yuan Yan Tang, and Tao Mei. 2014. Social image tagging with diverse semantics. IEEE Transactions on Cybernetics 44, 12 (Dec. 2014), 2493--2508.Google ScholarCross Ref
- Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of ACM Conference on Multimedia. 251--260. Google ScholarDigital Library
- Graham Rawlinson. 1976. The Significance of Letter Position in Word Recognition. Ph.D. Dissertation. University of Nottingham.Google Scholar
- Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks 61 (2015), 85--117. Google ScholarDigital Library
- Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops. Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Nitish Srivastava and Ruslan R. Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. In Proceedings of Advances in Neural Information Processing Systems. 2222--2230. Google ScholarDigital Library
- Chun Chet Tan. 2008. Autoencoder Neural Networks: A Performance Study Based on Image Recognition, Reconstruction and Compression. Ph.D. Dissertation. Multimedia University.Google Scholar
- Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016a. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5005--5013.Google ScholarCross Ref
- Meng Wang, Xian-Sheng Hua, Richang Hong, Jinhui Tang, Go-Jun Qi, and Yan Song. 2009. Unified video annotation via multigraph learning. IEEE Transactions on Circuits and Systems for Video Technology 19, 5 (2009), 733--746. Google ScholarDigital Library
- Meng Wang, Weisheng Li, Dong Liu, Bingbing Ni, Jialie Shen, and Shuicheng Yan. 2015a. Facilitating image search with a scalable and compact semantic mapping. IEEE Transactions on Cybernetics 45, 8 (2015), 1561--1574.Google ScholarCross Ref
- Meng Wang, Xueliang Liu, and Xindong Wu. 2015b. Visual classification by -hypergraph modeling. IEEE Transactions on Knowledge and Data Engineering 27, 9 (2015), 2564--2574.Google ScholarDigital Library
- Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, and Yueting Zhuang. 2016b. Effective deep learning-based multi-modal retrieval. VLDB Journal 25, 1 (2016), 79--101. Google ScholarDigital Library
- Fei Wu, Xinyang Jiang, Xi Li, Siliang Tang, Weiming Lu, Zhongfei Zhang, and Yueting Zhuang. 2015. Cross-modal learning to rank via latent joint representation. IEEE Transactions on Image Processing 24, 5 (2015), 1497--1509.Google ScholarDigital Library
- Liang Xie, Lei Zhu, and Guoqi Chen. 2016. Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval. Multimedia Tools and Applications 75, 15 (2016), 9185--9204. Google ScholarDigital Library
- Hao Xue, Like Xue, and Feng Su. 2015. Multimodal music mood classification by fusion of audio and lyrics. In Proceedings of International Conference on MultiMedia Modeling. Springer, 26--37.Google ScholarCross Ref
- Yan Yan, Feiping Nie, Wen Li, Chenqiang Gao, Yi Yang, and Dong Xu. 2016. Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia 18, 12 (2016), 2494--2502. Google ScholarDigital Library
- Yi Yang, Dong Xu, Feiping Nie, Jiebo Luo, and Yueting Zhuang. 2009. Ranking with local regression and global alignment for cross media retrieval. In Proceedings of ACM Conference on Multimedia. 175--184. Google ScholarDigital Library
- Kai Yu, Tong Zhang, and Yihong Gong. 2009. Nonlinear learning using local coordinate coding. In Proceedings of Advances in Neural Information Processing Systems, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Eds.). 2223--2231. Google ScholarDigital Library
- Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of European Conference on Computer Vision. Springer, 818--833.Google Scholar
- Hanwang Zhang, Xindi Shang, Huanbo Luan, Meng Wang, and Tat-Seng Chua. 2016. Learning from collective intelligence: Feature learning using social images and tags. ACM Transactions on Multimedia Computing, Communications and Applications 13, 1, Article 1 (Nov. 2016), 23 pages. Google ScholarDigital Library
- Luming Zhang, Yue Gao, Chaoqun Hong, Yinfu Feng, Jianke Zhu, and Deng Cai. 2014. Feature correlation hypergraph: Exploiting high-order potentials for multimodal recognition. IEEE Transactions on Cybernetics 44, 8 (2014), 1408--1419.Google ScholarCross Ref
- Xiang Zhang and Yann LeCun. 2015. Text understanding from scratch. CoRR abs/1502.01710 (2015).Google Scholar
- Fang Zhao, Yongzhen Huang, Liang Wang, Tao Xiang, and Tieniu Tan. 2016. Learning relevance restricted Boltzmann machine for unstructured group activity and event understanding. International Journal of Computer Vision 3 (2016), 1--17. Google ScholarDigital Library
- Liang Zheng, Shengjin Wang, and Qi Tian. 2014. Coupled binary embedding for large-scale image retrieval. IEEE Transactions on Image Processing 23, 8 (2014), 3368--3380.Google ScholarCross Ref
- Liang Zheng, Yi Yang, and Qi Tian. 2018. SIFT meets CNN: A decade survey of instance retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 5 (2018), 1224--1244.Google ScholarCross Ref
- Lin Zhong, Qingshan Liu, Peng Yang, Junzhou Huang, and Dimitris N. Metaxas. 2015. Learning multiscale active facial patches for expression analysis. IEEE Transactions on Cybernetics 45, 8 (Aug. 2015), 1499--1510.Google Scholar
- Yue-Ting Zhuang, Yi Yang, and Fei Wu. 2008. Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Transactions on Multimedia 10, 2 (Feb. 2008), 221--229. Google ScholarDigital Library
Index Terms
- Cross-Modality Feature Learning via Convolutional Autoencoder
Recommendations
Feature learning using convolutional denoising autoencoder for activity recognition
AbstractWearable technology offers a prospective solution to the increasing demand for activity monitoring in pervasive healthcare. Feature extraction and selection are crucial steps in activity recognition since it determines the accuracy of activity ...
Short Text Clustering Using Joint Optimization of Feature Representations and Cluster Assignments
PRICAI 2021: Trends in Artificial IntelligenceAbstractThe application of traditional text clustering methods to short text data is inefficient owing to the high dimensionality and semantic sparseness of such data. Contrastingly, convolutional neural networks can capture the local information between ...
Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning
ICMR '15: Proceedings of the 5th ACM on International Conference on Multimedia RetrievalCross-modal retrieval extends the ability of search engines to deal with the massive cross-modal data. The goal of image-text cross-modal retrieval is to search images (texts) by using text (image) queries by computing the similarities of images and ...
Comments