skip to main content
10.1145/3038228.3038237acmconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article
Public Access

High-performance Cholesky factorization for GPU-only execution

Published:04 February 2017Publication History

ABSTRACT

We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. Weachieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8x faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library.

References

  1. Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures. In The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016. IEEE, IEEE, Chicago, IL.Google ScholarGoogle Scholar
  2. Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs. In International Conference on Computational Science (ICCS'16). San Diego, CA.Google ScholarGoogle Scholar
  3. Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Hatem Ltaief, Raymond Namyst, Samuel Thibault, and Stanimire Tomov. 2010. Faster, Cheaper, Better -- a Hybridization Methodology to Develop Linear Algebra Software for GPUs. In GPU Computing Gems, Wen mei W. Hwu (Ed.). Vol. 2. Morgan Kaufmann.Google ScholarGoogle Scholar
  4. E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. 2009. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA projects. J. Phys.: Conf. Ser. 180, 1 (2009).Google ScholarGoogle ScholarCross RefCross Ref
  5. E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1992. LAPACK Users' Guide. SIAM, Philadelphia, PA. http://www.netlib.org/lapack/lug/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Cao, J. Dongarra, P. Du, M. Gates, P. Luszczek, and S. Tomov. 2013. clMAGMA: High Performance Dense Linear Algebra with OpenCL, In The ACM International Conference Series. International workshop on OpenCL (may 13-14 2013). (submitted). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Dong, A. Haidar, P. Luszczek, A. Harris, S. Tomov, and J. Dongarra. 2014. LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU. In Proceedings of 16th IEEE International Conference on High Performance and Communications (HPCC 2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Dong, A. Haidar, S. Tomov, and J. Dongarra. 2014. A Fast Batched Cholesky Factorization on a GPU. In Proc. of 2014 International Conference on Parallel Processing (ICPP-2014).Google ScholarGoogle Scholar
  9. Jack Dongarra, Iain Duff, Mark Gates, Azzam Haidar, Sven Hammarling, Nicholas J. Higham, Jonathon Hogg, Pedro Valero-Lara, Samuel D. Relton, Stanimire Tomov, and Mawussi Zounon. 2016. A Proposed API for Batched Basic Linear Algebra Subprograms. MIMS EPrint 2016.25. Manchester Institute for Mathematical Sciences, The University of Manchester, UK. 20 pages. http://eprints.ma.man.ac.uk/2464/Google ScholarGoogle Scholar
  10. J. Dongarra, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and A. YarKhan. 2014. Model-Driven One-Sided Factorizations on Multicore Accelerated Systems. International Journal on Supercomputing Frontiers and Innovations 1, 1 (June 2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. 1990. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 16, 1 (March 1990), 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Massimiliano Fatica. 2009. Accelerating Linpack with CUDA on Heterogenous Clusters. In Proceedings of 2Nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2). ACM, New York, NY, USA, 46--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Azzam Haidar, Chongxiao Cao, Asim Yarkhan, Piotr Luszczek, Stanimire Tomov, Khairul Kabir, and Jack Dongarra. 2014. Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS '14). IEEE Computer Society, Washington, DC, USA, 491--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stanimire Tomov, and Jack Dongarra. Batched matrix computations on hardware accelerators based on GPUs. International Journal of High Performance Computing Applications (02/2015 ????) Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Azzam Haidar, Tingxing Dong, Stanimire Tomov, Piotr Luszczek, and Jack Dongarra. 2015. Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations. In ISC High Performance. Springer, Springer, Frankfurt, Germany.Google ScholarGoogle Scholar
  16. Azzam Haidar, Jack Dongarra, Khairul Kabir, Mark Gates, Piotr Luszczek, Stanimire Tomov, and Yulu Jia. 2015. HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. Scientific Programming 23 (01-2015 2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Azzam Haidar, Stanimire Tomov, Konstantin Arturov, Murat Guney, Shane Story, and Jack Dongarra. 2016. LU, QR, and Cholesky Factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi. In IEEE High Performance Extreme Computing Conference (HPEC'16). IEEE, IEEE, Waltham, MA.Google ScholarGoogle Scholar
  18. Azzam Haidar, Stanimire Tomov, Piotr Luszczek, and Jack Dongarra. 2015. MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing. In 2015 IEEE High Performance Extreme Computing Conference (HPEC 15), (Best Paper Award). IEEE, IEEE, Waltham, MA.Google ScholarGoogle ScholarCross RefCross Ref
  19. Innovative Computing Laboratory, University of Tennessee 2010. PLASMA Users' Guide, Parallel Linear Algebra Software for Multicore Architectures, Version 2.0. Innovative Computing Laboratory, University of Tennessee. http://icl.cs.utk.edu/projectsfiles/plasma/pdf/users_guide.pdf.Google ScholarGoogle Scholar
  20. Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode, Stanimire Tomov, Guido Juckeland, Robert Dietrich, Duncan Poole, and Christopher Lamb. 2011. Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs. In Proc. of ICPP'11. IEEE Computer Society, Washington, DC, USA, 176--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. NVIDIA Corporation 2016. cuSOLVER 8.0. (2016). Available at http://docs.nvidia.com/cuda/cusolver/.Google ScholarGoogle Scholar
  22. Peter E. Strazdins. 1998. Lookahead and Algorithmic Blocking Techniques Compared for Parallel Matrix Factorization. In 10th International Conference on Parallel and Distributed Computing and Systems, IASTED. Las Vegas, USA.Google ScholarGoogle Scholar
  23. Erich Strohmaier, Jack Dongarra, Horst Simon, and Martin Meuer. 1993-2016. TOP500 Supercomputer Sites. (1993-2016). Available from: http://www.top500.org/.Google ScholarGoogle Scholar
  24. S. Tomov, J. Dongarra, and M. Baboulin. 2010. Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems. Parellel Comput. Syst. Appl. 36, 5-6 (2010), 232--240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ichitaro Yamazaki, Stanimire Tomov, and Jack Dongarra. 2012. Onesided Dense Matrix Factorizations on a Multicore with Multiple {GPU} Accelerators. Procedia Computer Science 9, 0 (2012), 37--46. Proceedings of the International Conference on Computational Science, {ICCS} 2012Google ScholarGoogle ScholarCross RefCross Ref
  1. High-performance Cholesky factorization for GPU-only execution

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      GPGPU-10: Proceedings of the General Purpose GPUs
      February 2017
      84 pages
      ISBN:9781450349154
      DOI:10.1145/3038228

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 February 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      GPGPU-10 Paper Acceptance Rate8of15submissions,53%Overall Acceptance Rate57of129submissions,44%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader