ABSTRACT
In this paper, we focus on the issue of large scale image annotation, whereas most existing methods are devised for small datasets. A novel model based on deep representation learning and tag embedding learning is proposed. Specifically, the proposed model learns an unified latent space for image visual features and tag embeddings simultaneously. Furthermore, a metric matrix is introduced to estimate the relevance scores between images and tags. Finally, an objective function modeling triplet relationships (irrelevant tag, image, relevant tag) is proposed with maximum margin pursuit. The proposed model is easy to tackle new images and tags via online learning and has a relatively low test computation complexity. Experimental results on NUS-WIDE dataset demonstrate the effectiveness of the proposed model.
- L. Ballan, T. Uricchio, L. Seidenari, and A. Del Bimbo. A cross-media model for automatic image annotation. In ACM International Conference on Multimedia Retrieval, pages 73--80, 2014. Google ScholarDigital Library
- G. Carneiro, A. Chan, P. Moreno, and N. Vasconcelos. Supervised learning of semantic classes for image annotation and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(3):394--410, 2007. Google ScholarDigital Library
- M. Chen, A. Zheng, and K. Weinberger. Fast image tagging. In International Conference on Machine Learning, pages 1274--1282, 2013.Google ScholarDigital Library
- T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. Nus-wide: a real-world web image database from national university of singapore. In ACM International Conference on Image and Video Retrieval, page 48, 2009. Google ScholarDigital Library
- J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Imagenet: a large-scale hierarchical image database. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 248--255, 2009.Google ScholarCross Ref
- P. Duygulu, K. Barnard, J. de Freitas, and D. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In European Conference on Computer Vision, pages 97--112. 2006. Google ScholarDigital Library
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524, 2013. Google ScholarDigital Library
- Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep convolutional ranking for multilabel image annotation. 2014.Google Scholar
- M. Grubinger, P. Clough, H. Muller, and T. Deselaers. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International Workshop OntoImage, pages 13--23, 2006.Google Scholar
- M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation. In IEEE International Conference on Computer Vision, pages 309--316, 2009.Google ScholarCross Ref
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia, pages 675--678, 2014. Google ScholarDigital Library
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, 1998.Google ScholarCross Ref
- D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91--110, 2004. Google ScholarDigital Library
- A. Makadia, V. Pavlovic, and S. Kumar. A new baseline for image annotation. In European Conference on Computer Vision, pages 316--329. 2008. Google ScholarDigital Library
- V. Murthy, E. Can, and R. Manmatha. A hybrid model for automatic image annotation. In ACM International Conference on Multimedia Retrieval, pages 369--376, 2014. Google ScholarDigital Library
- A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. arXiv preprint arXiv:1403.6382, 2014.Google Scholar
- V. Vapnik and V. Vapnik. Statistical learning theory, volume 2. Wiley New York, 1998.Google ScholarDigital Library
- L. Von Ahn and L. Dabbish. Labeling images with a computer game. In ACM SIGCHI Conference on Human Factors in Computing Systems, pages 319--326, 2004. Google ScholarDigital Library
- C. Wang, S. Yan, L. Zhang, and H. Zhang. Multi-label sparse coding for automatic image annotation. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 1643--1650, 2009.Google ScholarCross Ref
- J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning, 81(1):21--35, 2010. Google ScholarDigital Library
Index Terms
- Large Scale Image Annotation via Deep Representation Learning and Tag Embedding Learning
Recommendations
Joint multi-view representation learning and image tagging
AAAI'16: Proceedings of the Thirtieth AAAI Conference on Artificial IntelligenceAutomatic image annotation is an important problem in several machine learning applications such as image search. Since there exists a semantic gap between low-level image features and high-level semantics, the description ability of image ...
Learning Social Image Embedding with Deep Multimodal Attention Networks
Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and cross-modal search. However, for social images which contain both link ...
A 3D-CAE-CNN model for Deep Representation Learning of 3D images
AbstractDeep Representation Learning technologies based on supervised Convolutional Neural Networks (CNNs) have attained significant interest mainly due to their superior performance for learning abstract and robust features used in object ...
Highlights- We propose the idea of combining 3D-CAE-UFL and 3D-CNN-SFL approaches in order to create efficient and high quality deep learning representations for 3D ...
Comments