skip to main content
10.1145/3123266.3123435acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Wheel: Accelerating CNNs with Distributed GPUs via Hybrid Parallelism and Alternate Strategy

Published: 19 October 2017 Publication History

Abstract

Convolutional Neural Networks (CNNs) have been widely used and achieve amazing performance, typically at the cost of very expensive computation. Some methods accelerate the CNN training by distributed GPUs those deploying GPUs on multiple servers. Unfortunately, they need to transmit a large amount of data among servers, which leads to long data transmitting time and long GPU idle time. Towards this end, we propose a novel hybrid parallelism architecture named "Wheel" to accelerate the CNN training by reducing the transmitted data and fully using GPUs simultaneously. Specifically, Wheel first partitions the layers of a CNN into two kinds of modules: convolutional module and fully-connected module, and deploys them following the proposed hybrid parallelism. In this way, Wheel transmits only a few parameters of CNNs among different servers, and transmits most of the parameters within the same server. The time to transmit data is significantly reduced. Second, to fully run each GPU and reduce the idle time, Wheel devises an alternate strategy deploying multiple workers on each GPU. Once one worker is suspended for receiving data, another one in the same GPU starts to execute the computing task. The workers in each GPU run concurrently and repeatedly like Wheels. Experiments are conducted to show the outperformance of the proposed scheme over the state-of-the-art parallel approaches.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and others. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
[2]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
[3]
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI. 571--582.
[4]
Adam Coates, Brody Huval, Tao Wang, David J. Wu, and Andrew Y. Ng. 2013. Deep learning with COTS HPC systems. Proceedings of the 30 th International Conference on Machine Learning volume 28 (2013).
[5]
Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop.
[6]
Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems. ACM, 4.
[7]
Qi Dai, Jianguo Li, Jingdong Wang, and Yu-Gang Jiang. 2016. Binary optimized hashing. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1247--1256.
[8]
Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. 2016. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709 (2016).
[9]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, and Quoc V. Le. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.
[10]
Weiguang Ding, Ruoyan Wang, Fei Mao, and Graham Taylor. 2014. Theano-based large-scale visual recognition with multiple gpus. arXiv preprint arXiv:1412.2302 (2014).
[11]
Anshuman Goswami, Jeffrey Young, Karsten Schwan, Naila Farooqui, Ada Gavrilovska, Matthew Wolf, and Greg Eisenhauer. 2016. GPUShare: Fair-Sharing Middleware for GPU Clouds. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International. IEEE, 1769--1776.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[13]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia. ACM, 675--678.
[14]
Atsushi Kawai, Kenji Yasuoka, Kazuyuki Yoshikawa, and Tetsu Narumi. 2012. Distributed-Shared CUDA: Virtualization of large- scale GPU systems for programmability and reliability. In The Fourth International Conference on Future Computational Technologies and Applications.
[15]
Markus Koskela and Jorma Laaksonen. 2014. Convolutional Network Features for Scene Recognition. In Proceedings of the ACM International Conference on Multimedia - MM '14. 1169--1172.
[16]
Alex Krizhevsky. 2014. One weird trick for parallelizing con- volutional neural networks. arXiv preprint arXiv:1404.5997 (2014).
[17]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
[18]
Chuck L Lawson, Richard J. Hanson, David R Kincaid, and Fred T. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Transactions on Mathematical Software (TOMS) 5, 3 (1979), 308--323.
[19]
Min Lin, Qiang Chen, and Shuicheng Yan. 2014. Network in network. In ICLR.
[20]
Min Lin, Shuo Li, Xuan Luo, and Shuicheng Yan. 2014. Purine: A bi-graph based deep learning framework. arXiv preprint arXiv:1412.6249 (2014).
[21]
He Ma, Fei Mao, and Graham W. Taylor. 2016. Theano-MPI: a Theano-based Distributed Training Framework. arXiv preprint arXiv:1605.08325 (2016).
[22]
Garrick Orchard, Jacob G Martin, R Jacob Vogelstein, and Ralph Etienne-Cummings. 2013. Fast neuromimetic object recognition using FPGA outperforms GPU implementations. IEEE trans- actions on neural networks and learning systems 24, 8 (2013), 1239--1252.
[23]
Markos Papadonikolakis and Christos-Savvas Bouganis. 2012. Novel cascade FPGA accelerator for support vector machines classification. IEEE transactions on neural networks and learning systems 23, 7 (2012), 1040--1052.
[24]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[25]
Ilya Sutskever, James Martens, George E Dahl, and Geoffrey E Hinton. 2013. On the importance of initialization and momentum in deep learning. ICML (3) 28 (2013), 1139--1147.
[26]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.
[27]
Jinhui Tang, Xiangbo Shu, Zechao Li, Guo-Jun Qi, and Jingdong Wang. 2016. Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 4s (2016), 68.
[28]
The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, and others. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688 (2016).
[29]
Peisong Wang and Jian Cheng. 2016. Accelerating convolutional neural networks for mobile applications. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 541--545.
[30]
Wei Wang, Gang Chen, Anh Tien Tuan Dinh, Jinyang Gao, Beng Chin Ooi, Kian-Lee Tan, and Sheng Wang. 2015. SINGA: Putting deep learning in the hands of multimedia users. In Proceedings of the 23rd ACM international conference on Multi-media. ACM, 25--34.
[31]
Omry Yadan, Keith Adams, Yaniv Taigman, and MarcAurelio Ranzato. 2013. Multi-gpu training of convnets. arXiv preprint arXiv:1312.5853 9 (2013).
[32]
Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision. Springer, 818--833.

Cited By

View all
  • (2022)Optimization and acceleration of convolutional neural networks: A surveyJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2020.10.00434:7(4244-4268)Online publication date: Jul-2022
  • (2020)GPUs Utilization of Residual Network Training for Colon Histopathological Images Classification2020 International Conference on Computer Science and Its Application in Agriculture (ICOSICA)10.1109/ICOSICA49951.2020.9243276(1-8)Online publication date: 16-Sep-2020
  • (2019)Cauchy Matrix Factorization for Tag-Based Social Image RetrievalIEEE Access10.1109/ACCESS.2019.29405987(132302-132310)Online publication date: 2019

Index Terms

  1. Wheel: Accelerating CNNs with Distributed GPUs via Hybrid Parallelism and Alternate Strategy

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MM '17: Proceedings of the 25th ACM international conference on Multimedia
        October 2017
        2028 pages
        ISBN:9781450349062
        DOI:10.1145/3123266
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 19 October 2017

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. acceleration
        2. alternate strategy
        3. cnn
        4. distributed gpus
        5. hybrid parallelism

        Qualifiers

        • Research-article

        Funding Sources

        • 973 Program of China

        Conference

        MM '17
        Sponsor:
        MM '17: ACM Multimedia Conference
        October 23 - 27, 2017
        California, Mountain View, USA

        Acceptance Rates

        MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;
        Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)4
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 16 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2022)Optimization and acceleration of convolutional neural networks: A surveyJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2020.10.00434:7(4244-4268)Online publication date: Jul-2022
        • (2020)GPUs Utilization of Residual Network Training for Colon Histopathological Images Classification2020 International Conference on Computer Science and Its Application in Agriculture (ICOSICA)10.1109/ICOSICA49951.2020.9243276(1-8)Online publication date: 16-Sep-2020
        • (2019)Cauchy Matrix Factorization for Tag-Based Social Image RetrievalIEEE Access10.1109/ACCESS.2019.29405987(132302-132310)Online publication date: 2019

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media