skip to main content
10.1145/3146347.3146351acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Accelerating deep neural network learning for speech recognition on a cluster of GPUs

Published:12 November 2017Publication History

ABSTRACT

We train deep neural networks to solve the acoustic modeling problem for large-vocabulary continuous speech recognition. We employ distributed processing using a cluster of GPUs. On modern GPUs, the sequential implementation takes over a day to train, and efficient parallelization without losing accuracy is notoriously hard. We show that ASGD methods for parallelization are not efficient for this application. Even with 4 GPUs, the overhead is significant, and the accuracies achieved are poor. We adapt a P-learner K-step model averaging algorithm that with 4 GPUs achieves accuracies comparable to that achieved by the sequential implementation. We further introduce adaptive measures that make our parallel implementation scale to the full cluster of 20 GPUs. Ultimately our parallel implementation achieves better accuracies than the sequential implementation with a 6.1 times speedup.

References

  1. L Bottou, F. E. Curtis, and J Nocedal. Optimization methods for large-scale machine learning. 2017.Google ScholarGoogle Scholar
  2. R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods for machine learning. Math. Programming, 134, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. NVIDIA cuDNN -- GPU accelerated deep learning, https://developer.nvidia.com/cudnn.Google ScholarGoogle Scholar
  4. J. Dean, G. Corrado, R. Monga, and et al. Large scale distributed deep networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1223--1231. Curran Associates, Inc., 2012.Google ScholarGoogle Scholar
  5. O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165--202, 2012.Google ScholarGoogle Scholar
  6. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121--2159, July 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. J. Godfrey, E. C. Holliman, and J. McDaniel. SWITCHBOARD: Telephone speech corpus for research and development. In Proc. IEEE International Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 517--520, 1992. http://ieeexplore.ieee.org/document/225858/.Google ScholarGoogle ScholarCross RefCross Ref
  8. F Hashemi, S Ghosh, and R Pasupathy. On adaptive sampling rules for stochastic recursions. In Proc. 2014 Winter Simulation Conference, pages 3959--3970, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  9. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, November 1997. http://www.mitpressjournals.org/doi/pdfplus/10.1162/neco.1997.9.8.1735. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.Google ScholarGoogle Scholar
  11. 2000 HUB5 English evaluation speech. https://catalog.ldc.upenn.edu/LDC2002S09.Google ScholarGoogle Scholar
  12. 2000 HUB5 English evaluation transcripts. https://catalog.ldc.upenn.edu/LDC2002T43.Google ScholarGoogle Scholar
  13. Switchboard-1 release 2. https://catalog.ldc.upenn.edu/LDC97S62.Google ScholarGoogle Scholar
  14. M. Li, D. G. Andersen, J. W. Park, and et al. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 583--598, Broomfileld, CO, October 2014. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737--2745, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Mohamed, F. Seide, D. Yu, J. Droppo, A. Stolcke, G. Zweig, and G. Penn. Deep bi-directional recurrent networks over spectral windows. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015. http://ieeexplore.ieee.org/document/7404777/.Google ScholarGoogle ScholarCross RefCross Ref
  17. N. Morgan and H. Bourlard. An introduction to hybrid HMM/connectionist continuous speech recognition. IEEE Signal Processing Magazine, 12(3):25--42, May 1995. Google ScholarGoogle ScholarCross RefCross Ref
  18. mpiT--MPI for Torch, https://github.com/sixin-zh/mpiT.Google ScholarGoogle Scholar
  19. R Pasupathy, P Glynn, S Ghosh, and FHashemi. On sampling rates in simulation-based recursions. SIAM Journal of Optimization, 2017. in revisions.Google ScholarGoogle Scholar
  20. J. Picone. Switchboard resegmentation project. https://www.isip.piconepress.com/projects/switchboard/.Google ScholarGoogle Scholar
  21. B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Proc. IEEE International Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 6655--6659, 2013. http://ieeexplore.ieee.org/abstract/document/6638949/.Google ScholarGoogle ScholarCross RefCross Ref
  23. Torch -- A scientific computing framework for Luajit, http://torch.ch.Google ScholarGoogle Scholar
  24. S. J. Young, J. J. Odell, and P. C. Woodland. Tree-based state tying for high accuracy modelling. In Proc. Workshop on Human Language Technology, pages 307--312, 1994. http://aclweb.org/anthology/H/H94/H94-1062.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada, pages 685--693, 2015.Google ScholarGoogle Scholar
  26. F. Zhou and G. Cong. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv: 1708.01012 [cs.LG], August 2017. https://arxiv.org/abs/1708.01012.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    MLHPC'17: Proceedings of the Machine Learning on HPC Environments
    November 2017
    81 pages
    ISBN:9781450351379
    DOI:10.1145/3146347

    Copyright © 2017 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 12 November 2017

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate5of7submissions,71%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader