ABSTRACT
We train deep neural networks to solve the acoustic modeling problem for large-vocabulary continuous speech recognition. We employ distributed processing using a cluster of GPUs. On modern GPUs, the sequential implementation takes over a day to train, and efficient parallelization without losing accuracy is notoriously hard. We show that ASGD methods for parallelization are not efficient for this application. Even with 4 GPUs, the overhead is significant, and the accuracies achieved are poor. We adapt a P-learner K-step model averaging algorithm that with 4 GPUs achieves accuracies comparable to that achieved by the sequential implementation. We further introduce adaptive measures that make our parallel implementation scale to the full cluster of 20 GPUs. Ultimately our parallel implementation achieves better accuracies than the sequential implementation with a 6.1 times speedup.
- L Bottou, F. E. Curtis, and J Nocedal. Optimization methods for large-scale machine learning. 2017.Google Scholar
- R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods for machine learning. Math. Programming, 134, 2012. Google ScholarDigital Library
- NVIDIA cuDNN -- GPU accelerated deep learning, https://developer.nvidia.com/cudnn.Google Scholar
- J. Dean, G. Corrado, R. Monga, and et al. Large scale distributed deep networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1223--1231. Curran Associates, Inc., 2012.Google Scholar
- O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165--202, 2012.Google Scholar
- J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121--2159, July 2011.Google ScholarDigital Library
- J. J. Godfrey, E. C. Holliman, and J. McDaniel. SWITCHBOARD: Telephone speech corpus for research and development. In Proc. IEEE International Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 517--520, 1992. http://ieeexplore.ieee.org/document/225858/.Google ScholarCross Ref
- F Hashemi, S Ghosh, and R Pasupathy. On adaptive sampling rules for stochastic recursions. In Proc. 2014 Winter Simulation Conference, pages 3959--3970, 2014. Google ScholarCross Ref
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, November 1997. http://www.mitpressjournals.org/doi/pdfplus/10.1162/neco.1997.9.8.1735. Google ScholarDigital Library
- D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.Google Scholar
- 2000 HUB5 English evaluation speech. https://catalog.ldc.upenn.edu/LDC2002S09.Google Scholar
- 2000 HUB5 English evaluation transcripts. https://catalog.ldc.upenn.edu/LDC2002T43.Google Scholar
- Switchboard-1 release 2. https://catalog.ldc.upenn.edu/LDC97S62.Google Scholar
- M. Li, D. G. Andersen, J. W. Park, and et al. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 583--598, Broomfileld, CO, October 2014. USENIX Association. Google ScholarDigital Library
- X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737--2745, 2015.Google ScholarDigital Library
- A. Mohamed, F. Seide, D. Yu, J. Droppo, A. Stolcke, G. Zweig, and G. Penn. Deep bi-directional recurrent networks over spectral windows. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015. http://ieeexplore.ieee.org/document/7404777/.Google ScholarCross Ref
- N. Morgan and H. Bourlard. An introduction to hybrid HMM/connectionist continuous speech recognition. IEEE Signal Processing Magazine, 12(3):25--42, May 1995. Google ScholarCross Ref
- mpiT--MPI for Torch, https://github.com/sixin-zh/mpiT.Google Scholar
- R Pasupathy, P Glynn, S Ghosh, and FHashemi. On sampling rates in simulation-based recursions. SIAM Journal of Optimization, 2017. in revisions.Google Scholar
- J. Picone. Switchboard resegmentation project. https://www.isip.piconepress.com/projects/switchboard/.Google Scholar
- B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.Google ScholarDigital Library
- T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Proc. IEEE International Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 6655--6659, 2013. http://ieeexplore.ieee.org/abstract/document/6638949/.Google ScholarCross Ref
- Torch -- A scientific computing framework for Luajit, http://torch.ch.Google Scholar
- S. J. Young, J. J. Odell, and P. C. Woodland. Tree-based state tying for high accuracy modelling. In Proc. Workshop on Human Language Technology, pages 307--312, 1994. http://aclweb.org/anthology/H/H94/H94-1062.pdf. Google ScholarDigital Library
- S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada, pages 685--693, 2015.Google Scholar
- F. Zhou and G. Cong. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv: 1708.01012 [cs.LG], August 2017. https://arxiv.org/abs/1708.01012.Google Scholar
Recommendations
Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysCurrent-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for ...
Efficient SIMD implementation for accelerating convolutional neural network
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information ProcessingConvolutional Neural Network (CNN) has been used in a variety of fields such as computer vision, speech recognition, and natural language processing. Because the amount of computation has increased tremendously, CNN has lately been accelerated through ...
Accelerating Radiative Transfer Simulation on NVIDIA GPUs with OpenACC
Parallel and Distributed Computing, Applications and TechnologiesAbstractTo accelerate multiphysics applications, making use of not only GPUs but also FPGAs has been emerging. Multiphysics applications are simulations involving multiple physical models and multiple simultaneous physical phenomena. Operations with ...
Comments