skip to main content
10.1145/3184407.3184424acmconferencesArticle/Chapter ViewAbstractPublication PagesicpeConference Proceedingsconference-collections
research-article

Involving CPUs into Multi-GPU Deep Learning

Published:30 March 2018Publication History

ABSTRACT

The most important part of deep learning, training the neural network, often requires the processing of a large amount of data and can takes days to complete. Data parallelism is widely used for training deep neural networks on multiple GPUs in a single machine thanks to its simplicity. However, its scalability is bound by the number of data transfers, mainly for exchanging and accumulating gradients among the GPUs. In this paper, we present a novel approach to data parallel training called CPU-GPU data parallel (CGDP) training that utilizes free CPU time on the host to speed up the training in the GPUs. We also present a cost model for analyzing and comparing the performances of both the typical data parallel training and the CPU-GPU data parallel training. Using the cost model, we formally show why our approach is better than the typical one and clarify the remaining issues. Finally, we explain how we optimized CPU-GPU data parallel training by introducing chunks of layers and present a runtime algorithm that automatically finds a good configuration for the training. The algorithm is effective for very deep neural networks, which are the current trend in deep learning. Experimental results showed that we achieved speedups of $1.21$, $1.04$, $1.21$ and $1.07$ for four state-of-the-art neural networks: AlexNet, GoogLeNet-v1, VGGNet-16, and ResNet-152, respectively. Weak scaling efficiency greater than $90$ was achieved for all networks across four GPUs.

References

  1. 2016. IBM Power System S822LC for High Performance Computing. (Oct. 2016). http://www-03.ibm.com/systems/power/hardware/s822lc-hpc/.Google ScholarGoogle Scholar
  2. 2016. Torch. (Oct. 2016). http://torch.ch/.Google ScholarGoogle Scholar
  3. 2017. NVIDIA NCCL. (2017). https://developer.nvidia.com/nccl.Google ScholarGoogle Scholar
  4. 2017. Torch. (2017). https://luna16.grand-challenge.org.Google ScholarGoogle Scholar
  5. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015). http://tensorflow.org/ Software available from tensorflow.org.Google ScholarGoogle Scholar
  6. Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arXiv preprint arXiv:1512.01274 (2015).Google ScholarGoogle Scholar
  7. Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep Learning with COTS HPC Systems. In International Conference on Machine Learning, Vol. 28. JMLR Workshop and Conference Proceedings, 1337--1345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A Matlab-like Environment for Machine Learning. In BigLearn, NIPS Workshop.Google ScholarGoogle Scholar
  9. Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc?fAurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In International Conference on Neural Information Processing Systems. 1232--1240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/ abs/1512.03385Google ScholarGoogle Scholar
  12. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. CoRR abs/1502.01852 (2015).Google ScholarGoogle Scholar
  13. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. Springer International Publishing, 630--645.Google ScholarGoogle Scholar
  14. Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In International Conference on Neural Information Processing Systems. 1223--1231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).Google ScholarGoogle Scholar
  16. Alex Krizhevsky. 2014. One Weird Trick for Parallelizing Convolutional Neural Networks. arXiv preprint arXiv:1404.5997v2 (2014).Google ScholarGoogle Scholar
  17. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In International Conference on Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Quoc Le, Marc?Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg Corrado, Jeff Dean, and Andrew Ng. 2012. Building High-Level Features Using Large Scale Unsupervised Learning. In International Conference in Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Azalia Mirhoseini, Hieu Pham, Quoc Le, Mohammad Norouzi, Samy Bengio, Benoit Steiner, Yuefeng Zhou, Naveen Kumar, Rasmus Larsen, and Jeff Dean. 2017. Device Placement Optimization with Reinforcement Learning. https://arxiv.org/abs/1706.04972Google ScholarGoogle Scholar
  20. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, and Phil Hall. 2017. English Conversational Telephone Speech Recognition by Humans and Machines. CoRR abs/1703.02136 (2017).Google ScholarGoogle Scholar
  22. Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  23. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In International Conference on Neural Information Processing Systems. 3104--3112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarGoogle Scholar
  25. Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng Yao, Oleksii Kuchaiev, Yu Zhang, Frank Seide, Zhiheng Huang, Brian Guenter, Huaming Wang, Jasha Droppo, Geoffrey Zweig, Chris Rossbach, Jie Gao, Andreas Stolcke, Jon Currey, Malcolm Slaney, Guoguo Chen, Amit Agarwal, Chris Basoglu, Marko Padmilac, Alexey Kamenev, Vladimir Ivanov, Scott Cypher, Hari Parthasarathi, Bhaskar Mitra, Baolin Peng, and Xuedong Huang. 2014. An Introduction to Computational Networks and the Computational Network Toolkit. Technical Report.Google ScholarGoogle Scholar
  26. Hao Zhang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Gunhee Kim, Qirong Ho, and Eric Xing. 2015. Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines. arXiv preprint arXiv:1512.06216 (2015).Google ScholarGoogle Scholar
  27. Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. 2016. Staleness-aware async-SGD for Distributed Deep Learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI'16). AAAI Press, 2350--2356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yongqiang Zou, Xing Jin, Yi Li, Zhimao Guo, Eryu Wang, and Bin Xiao. 2014. Mariana: Tencent Deep Learning Platform and Its Applications. Proceedings of VLDB Endow. 7, 13 (Aug. 2014), 1772--1777. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Involving CPUs into Multi-GPU Deep Learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICPE '18: Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering
      March 2018
      328 pages
      ISBN:9781450350952
      DOI:10.1145/3184407

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 March 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate252of851submissions,30%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader