research-article

Accelerating deep neural network learning for speech recognition on a cluster of GPUs

Authors:
Guojing Cong

IBM Research, Yorktown Heights, New York

IBM Research, Yorktown Heights, New York
View Profile

,
Brian Kingsbury

IBM Research, Yorktown Heights, New York

IBM Research, Yorktown Heights, New York
View Profile

,
Soumyadip Gosh

IBM Research, Yorktown Heights, New York

IBM Research, Yorktown Heights, New York
View Profile

,
George Saon

IBM Research, Yorktown Heights, New York

IBM Research, Yorktown Heights, New York
View Profile

,
Fan Zhou

Georgia Institute of Technology, North Avenue NW, Atlanta, Georgia

Georgia Institute of Technology, North Avenue NW, Atlanta, Georgia
View Profile

MLHPC'17: Proceedings of the Machine Learning on HPC EnvironmentsNovember 2017Article No.: 3Pages 1–8https://doi.org/10.1145/3146347.3146351

Published:12 November 2017Publication History

MLHPC'17: Proceedings of the Machine Learning on HPC Environments

Pages 1–8

ABSTRACT

We train deep neural networks to solve the acoustic modeling problem for large-vocabulary continuous speech recognition. We employ distributed processing using a cluster of GPUs. On modern GPUs, the sequential implementation takes over a day to train, and efficient parallelization without losing accuracy is notoriously hard. We show that ASGD methods for parallelization are not efficient for this application. Even with 4 GPUs, the overhead is significant, and the accuracies achieved are poor. We adapt a P-learner K-step model averaging algorithm that with 4 GPUs achieves accuracies comparable to that achieved by the sequential implementation. We further introduce adaptive measures that make our parallel implementation scale to the full cluster of 20 GPUs. Ultimately our parallel implementation achieves better accuracies than the sequential implementation with a 6.1 times speedup.

References

L Bottou, F. E. Curtis, and J Nocedal. Optimization methods for large-scale machine learning. 2017.Google Scholar
R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods for machine learning. Math. Programming, 134, 2012. Google ScholarDigital Library
NVIDIA cuDNN -- GPU accelerated deep learning, https://developer.nvidia.com/cudnn.Google Scholar
J. Dean, G. Corrado, R. Monga, and et al. Large scale distributed deep networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1223--1231. Curran Associates, Inc., 2012.Google Scholar
O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165--202, 2012.Google Scholar
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121--2159, July 2011.Google ScholarDigital Library
J. J. Godfrey, E. C. Holliman, and J. McDaniel. SWITCHBOARD: Telephone speech corpus for research and development. In Proc. IEEE International Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 517--520, 1992. http://ieeexplore.ieee.org/document/225858/.Google ScholarCross Ref
F Hashemi, S Ghosh, and R Pasupathy. On adaptive sampling rules for stochastic recursions. In Proc. 2014 Winter Simulation Conference, pages 3959--3970, 2014. Google ScholarCross Ref
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, November 1997. http://www.mitpressjournals.org/doi/pdfplus/10.1162/neco.1997.9.8.1735. Google ScholarDigital Library
D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.Google Scholar
2000 HUB5 English evaluation speech. https://catalog.ldc.upenn.edu/LDC2002S09.Google Scholar
2000 HUB5 English evaluation transcripts. https://catalog.ldc.upenn.edu/LDC2002T43.Google Scholar
Switchboard-1 release 2. https://catalog.ldc.upenn.edu/LDC97S62.Google Scholar
M. Li, D. G. Andersen, J. W. Park, and et al. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 583--598, Broomfileld, CO, October 2014. USENIX Association. Google ScholarDigital Library
X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737--2745, 2015.Google ScholarDigital Library
A. Mohamed, F. Seide, D. Yu, J. Droppo, A. Stolcke, G. Zweig, and G. Penn. Deep bi-directional recurrent networks over spectral windows. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015. http://ieeexplore.ieee.org/document/7404777/.Google ScholarCross Ref
N. Morgan and H. Bourlard. An introduction to hybrid HMM/connectionist continuous speech recognition. IEEE Signal Processing Magazine, 12(3):25--42, May 1995. Google ScholarCross Ref
mpiT--MPI for Torch, https://github.com/sixin-zh/mpiT.Google Scholar
R Pasupathy, P Glynn, S Ghosh, and FHashemi. On sampling rates in simulation-based recursions. SIAM Journal of Optimization, 2017. in revisions.Google Scholar
J. Picone. Switchboard resegmentation project. https://www.isip.piconepress.com/projects/switchboard/.Google Scholar
B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.Google ScholarDigital Library
T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Proc. IEEE International Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 6655--6659, 2013. http://ieeexplore.ieee.org/abstract/document/6638949/.Google ScholarCross Ref
Torch -- A scientific computing framework for Luajit, http://torch.ch.Google Scholar
S. J. Young, J. J. Odell, and P. C. Woodland. Tree-based state tying for high accuracy modelling. In Proc. Workshop on Human Language Technology, pages 307--312, 1994. http://aclweb.org/anthology/H/H94/H94-1062.pdf. Google ScholarDigital Library
S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada, pages 685--693, 2015.Google Scholar
F. Zhou and G. Cong. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv: 1708.01012 [cs.LG], August 2017. https://arxiv.org/abs/1708.01012.Google Scholar

Recommendations

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for ...
Read More
Efficient SIMD implementation for accelerating convolutional neural network
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information Processing

Convolutional Neural Network (CNN) has been used in a variety of fields such as computer vision, speech recognition, and natural language processing. Because the amount of computation has increased tremendously, CNN has lately been accelerated through ...
Read More
Accelerating Radiative Transfer Simulation on NVIDIA GPUs with OpenACC
Parallel and Distributed Computing, Applications and Technologies
Abstract
To accelerate multiphysics applications, making use of not only GPUs but also FPGAs has been emerging. Multiphysics applications are simulations involving multiple physical models and multiple simultaneous physical phenomena. Operations with ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

MLHPC'17: Proceedings of the Machine Learning on HPC Environments
November 2017
81 pages
ISBN:9781450351379
DOI:10.1145/3146347

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate5of7submissions,71%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 173
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Accelerating deep neural network learning for speech recognition on a cluster of GPUs

MLHPC'17: Proceedings of the Machine Learning on HPC Environments

ABSTRACT

References

Cited By

Recommendations

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

Efficient SIMD implementation for accelerating convolutional neural network

Accelerating Radiative Transfer Simulation on NVIDIA GPUs with OpenACC

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Accelerating deep neural network learning for speech recognition on a cluster of GPUs

MLHPC'17: Proceedings of the Machine Learning on HPC Environments

ABSTRACT

References

Cited By

Recommendations

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

Efficient SIMD implementation for accelerating convolutional neural network

Accelerating Radiative Transfer Simulation on NVIDIA GPUs with OpenACC

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media