research-article

Open Access

Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning

Authors:
Arissa Wongpanich

Computer Science Division, UC Berkeley, Berkeley, CA, USA

Computer Science Division, UC Berkeley, Berkeley, CA, USA
View Profile

,
Yang You

Computer Science Division, UC Berkeley, Berkeley, CA, USA

Computer Science Division, UC Berkeley, Berkeley, CA, USA
View Profile

,
James Demmel

Computer Science Division, UC Berkeley, Berkeley, CA, USA

Computer Science Division, UC Berkeley, Berkeley, CA, USA
View Profile

HPCAsia '20: Proceedings of the International Conference on High Performance Computing in Asia-Pacific RegionJanuary 2020Pages 52–60https://doi.org/10.1145/3368474.3368498

Published:15 January 2020Publication History

HPCAsia '20: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

Pages 52–60

ABSTRACT

In recent years, the field of machine learning has seen significant advances as data becomes more abundant and deep learning models become larger and more complex. However, these improvements in accuracy [2] have come at the cost of longer training time. As a result, state-of-the-art models like OpenAI's GPT-2 [18] or AlphaZero [20] require the use of distributed systems or clusters in order to speed up training. Currently, there exist both asynchronous and synchronous solvers for distributed training. In this paper, we implement state-of-the-art asynchronous and synchronous solvers, then conduct a comparison between them to help readers pick the most appropriate solver for their own applications. We address three main challenges: (1) implementing asynchronous solvers that can outperform six common algorithm variants, (2) achieving state-of-the-art distributed performance for various applications with different computational patterns, and (3) maintaining accuracy for large-batch asynchronous training. For asynchronous algorithms, we implement an algorithm called EA-wild, which combines the idea of non-locking wild updates from Hogwild! [19] with EASGD. Our implementation is able to scale to 217,600 cores and finish 90 epochs of training the ResNet-50 model on ImageNet in 15 minutes (the baseline takes 29 hours on eight NVIDIA P100 GPUs). We conclude that more complex models (e.g., ResNet-50) favor synchronous methods, while our asynchronous solver outperforms the synchronous solver for models with a low computation-communication ratio. The results are documented in this paper; for more results, readers can refer to our supplemental website 1.

References

T. Akiba, S. Suzuki, and K. Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.Google Scholar
D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595, 2015.Google Scholar
W. M. Czarnecki, R. Pascanu, S. Osindero, S. M. Jayakumar, G. Swirszcz, and M.Jaderberg. Distilling Policy Distillation. arXiv e-prints, page arXiv:1902.02186, Feb 2019.Google Scholar
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223--1231, 2012.Google ScholarDigital Library
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248--255. IEEE, 2009.Google ScholarCross Ref
P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770--778, 2016.Google ScholarCross Ref
G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.Google Scholar
F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2592--2600, 2016.Google ScholarCross Ref
N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.Google Scholar
A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.Google Scholar
J. Langford, A.J. Smola, and M. Zinkevich. Slow learners are fast. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, pages 2331--2339. Curran Associates Inc., 2009.Google ScholarDigital Library
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, 1998.Google ScholarCross Ref
H. Mikami, H. Suganuma, Y. Tanaka, Y. Kageyama, et al. Imagenet/resnet-50 training in 224 seconds. arXiv preprint arXiv:1811.05233, 2018.Google Scholar
V. Mnih, A. Puigdomènech Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. arXiv e-prints, page arXiv:1602.01783, Feb 2016.Google Scholar
A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, S. Legg, V. Mnih, K. Kavukcuoglu, and D. Silver. Massively Parallel Methods for Deep Reinforcement Learning. arXiv e-prints, page arXiv:1507.04296, Jul 2015.Google Scholar
F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. 2011.Google Scholar
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019.Google Scholar
B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.Google ScholarDigital Library
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.Google Scholar
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1--9, 2015.Google ScholarCross Ref
M. Yamazaki, A. Kasagi, A. Tabuchi, T. Honda, M. Miwa, N. Fukumoto, T. Tabaru, A. Ike, and K. Nakashima. Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds. arXiv preprint arXiv:1903.12650, 2019.Google Scholar
Y. You, A. Buluç, and J. Demmel. Scaling deep learning on gpu and knights landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 9. ACM, 2017.Google Scholar
Y. You, I. Gitman, and B. Ginsburg. Large Batch Training of Convolutional Networks. arXiv e-prints, page arXiv:1708.03888, Aug 2017.Google Scholar
Y. You, I. Gitman, and B. Ginsburg. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017.Google Scholar
Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017.Google Scholar
S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pages 685--693, 2015.Google ScholarDigital Library

Index Terms

Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning
ICPP '20: Proceedings of the 49th International Conference on Parallel Processing

Distributed parallel training using computing clusters is desirable for large scale deep neural networks. One of the key challenges in distributed training is the communication cost for exchanging information, such as stochastic gradients, among ...
Read More
Revisiting Asynchronous Linear Solvers: Provable Convergence Rate through Randomization
IPDPS '14: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium

Asynchronous methods for solving systems of linear equations have been researched since Chazan and Mir Anker's pioneering 1969 paper. The underlying idea of asynchronous methods is to avoid processor idle time by allowing the processors to continue to ...
Read More
Asynchronous automata versus asynchronous cellular automata

In this paper we compare and study some properties of two mathematical models of concurrent systems, asynchronous automata (Zielonka, 1987) and asynchronous cellular automata (Zielonka, 1989). First, we show that these models are "polynomially" related, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

HPCAsia '20: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
January 2020
247 pages
ISBN:9781450372367
DOI:10.1145/3368474

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 January 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate69of143submissions,48%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 913
  Total Downloads
- Downloads (Last 12 months)92
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning

HPCAsia '20: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

ABSTRACT

References

Cited By

Index Terms

Recommendations

Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning

Revisiting Asynchronous Linear Solvers: Provable Convergence Rate through Randomization

Asynchronous automata versus asynchronous cellular automata

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning

HPCAsia '20: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

ABSTRACT

References

Cited By

Index Terms

Recommendations

Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning

Revisiting Asynchronous Linear Solvers: Provable Convergence Rate through Randomization

Asynchronous automata versus asynchronous cellular automata

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media