skip to main content
10.1145/3368474.3368498acmotherconferencesArticle/Chapter ViewAbstractPublication PageshpcasiaConference Proceedingsconference-collections
research-article
Open Access

Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning

Authors Info & Claims
Published:15 January 2020Publication History

ABSTRACT

In recent years, the field of machine learning has seen significant advances as data becomes more abundant and deep learning models become larger and more complex. However, these improvements in accuracy [2] have come at the cost of longer training time. As a result, state-of-the-art models like OpenAI's GPT-2 [18] or AlphaZero [20] require the use of distributed systems or clusters in order to speed up training. Currently, there exist both asynchronous and synchronous solvers for distributed training. In this paper, we implement state-of-the-art asynchronous and synchronous solvers, then conduct a comparison between them to help readers pick the most appropriate solver for their own applications. We address three main challenges: (1) implementing asynchronous solvers that can outperform six common algorithm variants, (2) achieving state-of-the-art distributed performance for various applications with different computational patterns, and (3) maintaining accuracy for large-batch asynchronous training. For asynchronous algorithms, we implement an algorithm called EA-wild, which combines the idea of non-locking wild updates from Hogwild! [19] with EASGD. Our implementation is able to scale to 217,600 cores and finish 90 epochs of training the ResNet-50 model on ImageNet in 15 minutes (the baseline takes 29 hours on eight NVIDIA P100 GPUs). We conclude that more complex models (e.g., ResNet-50) favor synchronous methods, while our asynchronous solver outperforms the synchronous solver for models with a low computation-communication ratio. The results are documented in this paper; for more results, readers can refer to our supplemental website 1.

References

  1. T. Akiba, S. Suzuki, and K. Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.Google ScholarGoogle Scholar
  2. D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595, 2015.Google ScholarGoogle Scholar
  3. W. M. Czarnecki, R. Pascanu, S. Osindero, S. M. Jayakumar, G. Swirszcz, and M.Jaderberg. Distilling Policy Distillation. arXiv e-prints, page arXiv:1902.02186, Feb 2019.Google ScholarGoogle Scholar
  4. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223--1231, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248--255. IEEE, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  6. P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.Google ScholarGoogle Scholar
  7. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770--778, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  8. G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.Google ScholarGoogle Scholar
  9. F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2592--2600, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  10. N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.Google ScholarGoogle Scholar
  11. A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.Google ScholarGoogle Scholar
  12. J. Langford, A.J. Smola, and M. Zinkevich. Slow learners are fast. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, pages 2331--2339. Curran Associates Inc., 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  14. H. Mikami, H. Suganuma, Y. Tanaka, Y. Kageyama, et al. Imagenet/resnet-50 training in 224 seconds. arXiv preprint arXiv:1811.05233, 2018.Google ScholarGoogle Scholar
  15. V. Mnih, A. Puigdomènech Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. arXiv e-prints, page arXiv:1602.01783, Feb 2016.Google ScholarGoogle Scholar
  16. A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, S. Legg, V. Mnih, K. Kavukcuoglu, and D. Silver. Massively Parallel Methods for Deep Reinforcement Learning. arXiv e-prints, page arXiv:1507.04296, Jul 2015.Google ScholarGoogle Scholar
  17. F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. 2011.Google ScholarGoogle Scholar
  18. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019.Google ScholarGoogle Scholar
  19. B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.Google ScholarGoogle Scholar
  21. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1--9, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  22. M. Yamazaki, A. Kasagi, A. Tabuchi, T. Honda, M. Miwa, N. Fukumoto, T. Tabaru, A. Ike, and K. Nakashima. Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds. arXiv preprint arXiv:1903.12650, 2019.Google ScholarGoogle Scholar
  23. Y. You, A. Buluç, and J. Demmel. Scaling deep learning on gpu and knights landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 9. ACM, 2017.Google ScholarGoogle Scholar
  24. Y. You, I. Gitman, and B. Ginsburg. Large Batch Training of Convolutional Networks. arXiv e-prints, page arXiv:1708.03888, Aug 2017.Google ScholarGoogle Scholar
  25. Y. You, I. Gitman, and B. Ginsburg. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017.Google ScholarGoogle Scholar
  26. Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017.Google ScholarGoogle Scholar
  27. S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pages 685--693, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            HPCAsia '20: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
            January 2020
            247 pages
            ISBN:9781450372367
            DOI:10.1145/3368474

            Copyright © 2020 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 15 January 2020

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            Overall Acceptance Rate69of143submissions,48%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader