ABSTRACT
How to learn a discriminative fine-grained representation is a key point in many computer vision applications, such as person re-identification, fine-grained classification, fine-grained image retrieval, etc. Most of the previous methods focus on learning metrics or ensemble to derive better global representation, which are usually lack of local information. Based on the considerations above, we propose a novel Attribute-Aware Attention Model ($A^3M$), which can learn local attribute representation and global category representation simultaneously in an end-to-end manner. The proposed model contains two attention models: attribute-guided attention module uses attribute information to help select category features in different regions, at the same time, category-guided attention module selects local features of different attributes with the help of category cues. Through this attribute-category reciprocal process, local and global features benefit from each other. Finally, the resulting feature contains more intrinsic information for image recognition instead of the noisy and irrelevant features. Extensive experiments conducted on Market-1501, CompCars, CUB-200-2011 and CARS196 demonstrate the effectiveness of our $A^3M$.
Supplemental Material
Available for Download
The supplemental material's pdf form is under this fold. The source file is appendix.tex.
- Ejaz Ahmed, Michael Jones, and Tim K Marks. 2015. An improved deep learning architecture for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3908--3916.Google ScholarCross Ref
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- Song Bai, Xiang Bai, and Qi Tian. 2017. Scalable person re-identification on supervised smoothed manifold. CVPR (2017).Google Scholar
- Mohsen Biglari, Ali Soleimani, and Hamid Hassanpour. 2018. A Cascaded Part-Based System for Fine-Grained Vehicle Classification. IEEE Transactions on Intelligent Transportation Systems, Vol. 19, 1 (2018), 273--283.Google ScholarCross Ref
- Steve Branson, Grant Van Horn, Serge Belongie, and Pietro Perona. 2014. Bird species categorization using pose normalized deep convolutional nets. (2014).Google Scholar
- Ying-Cong Chen, Xiatian Zhu, Wei-Shi Zheng, and Jian-Huang Lai. 2017. Person re-identification by camera correlation aware feature augmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017). Google ScholarDigital Library
- Franccois Chollet. 2015. Keras. https://github.com/fchollet/keras. (2015).Google Scholar
- Yin Cui, Feng Zhou, Yuanqing Lin, and Serge Belongie. 2016. Fine-Grained Categorization and Dataset Bootstrapping Using Deep Metric Learning with Humans in the Loop. In IEEE Conference on Computer Vision and Pattern Recognition. 1153--1162.Google Scholar
- Thibaut Durand, Nicolas Thome, and Matthieu Cord. 2016. Weldon: Weakly supervised learning of deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4743--4752.Google ScholarCross Ref
- Jianlong Fu, Heliang Zheng, and Tao Mei. 2017. Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
- Shaoli Huang, Zhe Xu, Dacheng Tao, and Ya Zhang. 2016. Part-stacked CNN for fine-grained visual categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1173--1182.Google ScholarCross Ref
- Martin Koestinger, Martin Hirzer, Paul Wohlhart, Peter M Roth, and Horst Bischof. 2012. Large scale metric learning from equivalence constraints. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2288--2295. Google ScholarDigital Library
- J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 2013. 3D Object Representations for Fine-Grained Categorization. In 2013 IEEE International Conference on Computer Vision Workshops. 554--561. Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems. 1097--1105. Google ScholarDigital Library
- Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 951--958.Google ScholarCross Ref
- Ryan Layne, Timothy M Hospedales, Shaogang Gong, and Q Mary. 2012. Person Re-identification by Attributes.. In Bmvc, Vol. 2. 8.Google Scholar
- Dangwei Li, Xiaotang Chen, Zhang Zhang, and Kaiqi Huang. 2017. Learning Deep Context-aware Features over Body and Latent Parts for Person Re-identification. In IEEE Conference on Computer Vision and Pattern Recognition .Google ScholarCross Ref
- Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. 2015. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2197--2206.Google ScholarCross Ref
- Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. 2015. Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Computer Vision. 1449--1457. Google ScholarDigital Library
- Yutian Lin, Liang Zheng, Zhedong Zheng, Yu Wu, and Yi Yang. 2017. Improving Person Re-identification by Attribute and Identity Learning. arXiv preprint arXiv:1703.07220 (2017).Google Scholar
- Hao Liu, Jiashi Feng, Meibin Qi, Jianguo Jiang, and Shuicheng Yan. 2017a. End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing (2017).Google Scholar
- Xiao Liu, Jiang Wang, Shilei Wen, Errui Ding, and Yuanqing Lin. 2017b. Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition. In AAAI. 4190--4196.Google Scholar
- Raghavan Prabhakar Manning, Christopher D. and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.Google Scholar
- ES Tetsu Matsukawa and Einoshin Suzuki. 2016. Person re-identification using cnn features learned from combination of attributes. ICPR.Google Scholar
- Michael Opitz, Georg Waltner, Horst Possegger, and Horst Bischof. 2017. BIER-Boosting Independent Embeddings Robustly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5189--5198.Google ScholarCross Ref
- Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. 2015. Is object localization for free-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 685--694.Google ScholarCross Ref
- Yuxin Peng, Xiangteng He, and Junjie Zhao. 2018. Object-Part Attention Model for Fine-Grained Image Classification. IEEE Transactions on Image Processing, Vol. 27 (2018), 1487--1500.Google ScholarCross Ref
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et almbox. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, Vol. 115, 3 (2015), 211--252. Google ScholarDigital Library
- Xiangbo Shu, Jinhui Tang, Guo-Jun Qi, Zechao Li, Yu-Gang Jiang, and Shuicheng Yan. 2016. Image classification with tailored fine-grained dictionaries. IEEE Transactions on Circuits and Systems for Video Technology (2016).Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Kihyuk Sohn. 2016. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 1857--1865. Google ScholarDigital Library
- Chi Su, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. 2016. Deep attributes driven multi-camera person re-identification. In European Conference on Computer Vision. Springer, 475--491.Google ScholarCross Ref
- Rahul Rama Varior, Mrinal Haloi, and Gang Wang. 2016a. Gated siamese convolutional neural network architecture for human re-identification. In European Conference on Computer Vision. Springer, 791--808.Google ScholarCross Ref
- Rahul Rama Varior, Bing Shuai, Jiwen Lu, Dong Xu, and Gang Wang. 2016b. A Siamese Long Short-Term Memory Architecture for Human Re-identification. In European Conference on Computer Vision. 135--153.Google ScholarCross Ref
- Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The caltech-ucsd birds-200--2011 dataset. (2011).Google Scholar
- Dequan Wang, Zhiqiang Shen, Jie Shao, Wei Zhang, Xiangyang Xue, and Zheng Zhang. 2015. Multiple granularity descriptors for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision. 2399--2406. Google ScholarDigital Library
- Gang Wang and David Forsyth. 2009. Joint learning of visual attributes, object classes and visual saliency. In Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 537--544.Google ScholarCross Ref
- Yaming Wang, Jonghyun Choi, Vlad I. Morariu, and Larry Davis. 2016. Mining Discriminative Triplets of Patches for Fine-Grained Classification. arXiv:1605.01130 (2016).Google Scholar
- Shaogang Gong Wei Li, Xiatian Zhu. 2017. Person Re-Identification by Deep Joint Learning of Multi-Loss Classification. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17. 2194--2200. Google ScholarDigital Library
- Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision. Springer, 451--466.Google ScholarCross Ref
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048--2057. Google ScholarDigital Library
- Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2015. A large-scale car dataset for fine-grained categorization and verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3973--3981.Google ScholarCross Ref
- Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.Google ScholarCross Ref
- Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. 2014. Deep metric learning for person re-identification. In Pattern Recognition (ICPR), 2014 22nd International Conference on. IEEE, 34--39. Google ScholarDigital Library
- Yuhui Yuan, Kuiyuan Yang, and Chao Zhang. 2016. Hard-aware deeply cascaded embedding. CoRR, abs/1611.05720, Vol. 1 (2016).Google Scholar
- Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. 2014. Part-based R-CNNs for fine-grained category detection. In European conference on computer vision. Springer, 834--849.Google ScholarCross Ref
- Xiaopeng Zhang, Hongkai Xiong, Wengang Zhou, Weiyao Lin, and Qi Tian. 2016. Picking deep filter responses for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1134--1142.Google ScholarCross Ref
- Liming Zhao, Xi Li, Jingdong Wang, and Yueting Zhuang. 2017. Deeply-Learned Part-Aligned Representations for Person Re-Identification. ICCV (2017).Google Scholar
- Rui Zhao, Wanli Ouyang, and Xiaogang Wang. 2013. Person re-identification by salience matching. In Proceedings of the IEEE International Conference on Computer Vision. 2528--2535. Google ScholarDigital Library
- Liang Zheng, Liyue Shen, Lu Tian, and Shengjin Wang. 2015. Scalable Person Re-identification: A Benchmark. In IEEE International Conference on Computer Vision. 1116--1124. Google ScholarDigital Library
- Zhedong Zheng, Liang Zheng, and Yi Yang. 2016. A Discriminatively Learned CNN Embedding for Person Re-identification. arXiv preprint arXiv:1611.05666 (2016).Google Scholar
- Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. 2017. Re-ranking Person Re-identification with k-reciprocal Encoding. CVPR (2017).Google Scholar
Index Terms
- Attribute-Aware Attention Model for Fine-grained Representation Learning
Recommendations
Attention cutting and padding learning for fine-grained image recognition
AbstractFine-grained image recognition is an important task in the field of computer vision. In fine-grained image recognition, the difference between different categories is very small. Thus, fine-grained image recognition highly depends on local ...
Learning Structured Relation Embeddings for Fine-Grained Fashion Attribute Recognition
Fashion attribute recognition is a not-new topic, but rather a core task in understanding fashion from the perspective of computer vision. This article proposes a structured relation-aware network (sRA-Net), which exploits multiple hidden relations in ...
Comments