research-article

Public Access

High-performance Cholesky factorization for GPU-only execution

Authors:
Azzam Haidar

University of Tennessee, U.S.A.

University of Tennessee, U.S.A.
View Profile

,
Ahmad Abdelfatah

University of Tennessee, U.S.A.

University of Tennessee, U.S.A.
View Profile

,
Stanimire Tomov

University of Tennessee, U.S.A.

University of Tennessee, U.S.A.
View Profile

,
Jack Dongarra

University of Tennessee, U.S.A., Oak Ridge National Laboratory, U.S.A., University of Manchester, UK

University of Tennessee, U.S.A., Oak Ridge National Laboratory, U.S.A., University of Manchester, UK
View Profile

Authors Info & Claims

GPGPU-10: Proceedings of the General Purpose GPUsFebruary 2017Pages 42–52https://doi.org/10.1145/3038228.3038237

Published:04 February 2017Publication History

GPGPU-10: Proceedings of the General Purpose GPUs

Pages 42–52

ABSTRACT

We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. Weachieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8x faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library.

References

Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures. In The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016. IEEE, IEEE, Chicago, IL.Google Scholar
Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs. In International Conference on Computational Science (ICCS'16). San Diego, CA.Google Scholar
Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Hatem Ltaief, Raymond Namyst, Samuel Thibault, and Stanimire Tomov. 2010. Faster, Cheaper, Better -- a Hybridization Methodology to Develop Linear Algebra Software for GPUs. In GPU Computing Gems, Wen mei W. Hwu (Ed.). Vol. 2. Morgan Kaufmann.Google Scholar
E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. 2009. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA projects. J. Phys.: Conf. Ser. 180, 1 (2009).Google ScholarCross Ref
E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1992. LAPACK Users' Guide. SIAM, Philadelphia, PA. http://www.netlib.org/lapack/lug/. Google ScholarDigital Library
C. Cao, J. Dongarra, P. Du, M. Gates, P. Luszczek, and S. Tomov. 2013. clMAGMA: High Performance Dense Linear Algebra with OpenCL, In The ACM International Conference Series. International workshop on OpenCL (may 13-14 2013). (submitted). Google ScholarDigital Library
T. Dong, A. Haidar, P. Luszczek, A. Harris, S. Tomov, and J. Dongarra. 2014. LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU. In Proceedings of 16th IEEE International Conference on High Performance and Communications (HPCC 2014). Google ScholarDigital Library
T. Dong, A. Haidar, S. Tomov, and J. Dongarra. 2014. A Fast Batched Cholesky Factorization on a GPU. In Proc. of 2014 International Conference on Parallel Processing (ICPP-2014).Google Scholar
Jack Dongarra, Iain Duff, Mark Gates, Azzam Haidar, Sven Hammarling, Nicholas J. Higham, Jonathon Hogg, Pedro Valero-Lara, Samuel D. Relton, Stanimire Tomov, and Mawussi Zounon. 2016. A Proposed API for Batched Basic Linear Algebra Subprograms. MIMS EPrint 2016.25. Manchester Institute for Mathematical Sciences, The University of Manchester, UK. 20 pages. http://eprints.ma.man.ac.uk/2464/Google Scholar
J. Dongarra, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and A. YarKhan. 2014. Model-Driven One-Sided Factorizations on Multicore Accelerated Systems. International Journal on Supercomputing Frontiers and Innovations 1, 1 (June 2014). Google ScholarDigital Library
J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. 1990. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 16, 1 (March 1990), 1--17. Google ScholarDigital Library
Massimiliano Fatica. 2009. Accelerating Linpack with CUDA on Heterogenous Clusters. In Proceedings of 2Nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2). ACM, New York, NY, USA, 46--51. Google ScholarDigital Library
Azzam Haidar, Chongxiao Cao, Asim Yarkhan, Piotr Luszczek, Stanimire Tomov, Khairul Kabir, and Jack Dongarra. 2014. Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS '14). IEEE Computer Society, Washington, DC, USA, 491--500. Google ScholarDigital Library
Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stanimire Tomov, and Jack Dongarra. Batched matrix computations on hardware accelerators based on GPUs. International Journal of High Performance Computing Applications (02/2015 ????) Google ScholarDigital Library
Azzam Haidar, Tingxing Dong, Stanimire Tomov, Piotr Luszczek, and Jack Dongarra. 2015. Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations. In ISC High Performance. Springer, Springer, Frankfurt, Germany.Google Scholar
Azzam Haidar, Jack Dongarra, Khairul Kabir, Mark Gates, Piotr Luszczek, Stanimire Tomov, and Yulu Jia. 2015. HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. Scientific Programming 23 (01-2015 2015). Google ScholarDigital Library
Azzam Haidar, Stanimire Tomov, Konstantin Arturov, Murat Guney, Shane Story, and Jack Dongarra. 2016. LU, QR, and Cholesky Factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi. In IEEE High Performance Extreme Computing Conference (HPEC'16). IEEE, IEEE, Waltham, MA.Google Scholar
Azzam Haidar, Stanimire Tomov, Piotr Luszczek, and Jack Dongarra. 2015. MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing. In 2015 IEEE High Performance Extreme Computing Conference (HPEC 15), (Best Paper Award). IEEE, IEEE, Waltham, MA.Google ScholarCross Ref
Innovative Computing Laboratory, University of Tennessee 2010. PLASMA Users' Guide, Parallel Linear Algebra Software for Multicore Architectures, Version 2.0. Innovative Computing Laboratory, University of Tennessee. http://icl.cs.utk.edu/projectsfiles/plasma/pdf/users_guide.pdf.Google Scholar
Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode, Stanimire Tomov, Guido Juckeland, Robert Dietrich, Duncan Poole, and Christopher Lamb. 2011. Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs. In Proc. of ICPP'11. IEEE Computer Society, Washington, DC, USA, 176--185. Google ScholarDigital Library
NVIDIA Corporation 2016. cuSOLVER 8.0. (2016). Available at http://docs.nvidia.com/cuda/cusolver/.Google Scholar
Peter E. Strazdins. 1998. Lookahead and Algorithmic Blocking Techniques Compared for Parallel Matrix Factorization. In 10th International Conference on Parallel and Distributed Computing and Systems, IASTED. Las Vegas, USA.Google Scholar
Erich Strohmaier, Jack Dongarra, Horst Simon, and Martin Meuer. 1993-2016. TOP500 Supercomputer Sites. (1993-2016). Available from: http://www.top500.org/.Google Scholar
S. Tomov, J. Dongarra, and M. Baboulin. 2010. Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems. Parellel Comput. Syst. Appl. 36, 5-6 (2010), 232--240. Google ScholarDigital Library
Ichitaro Yamazaki, Stanimire Tomov, and Jack Dongarra. 2012. Onesided Dense Matrix Factorizations on a Multicore with Multiple {GPU} Accelerators. Procedia Computer Science 9, 0 (2012), 37--46. Proceedings of the International Conference on Computational Science, {ICCS} 2012Google ScholarCross Ref

High-performance Cholesky factorization for GPU-only execution
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms

Recommendations

Optimization for performance and energy for batched matrix computations on GPUs
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, ...
Read More
Batched matrix computations on hardware accelerators based on GPUs

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need ...
Read More
Towards batched linear solvers on accelerated hardware platforms
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

GPGPU-10: Proceedings of the General Purpose GPUs
February 2017
84 pages
ISBN:9781450349154
DOI:10.1145/3038228

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 February 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
factorization
hardware accelerators
numerical linear algebra
numerical software libraries
one-sided factorization algorithms
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
GPGPU-10 Paper Acceptance Rate8of15submissions,53%Overall Acceptance Rate57of129submissions,44%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 1,365
  Total Downloads
- Downloads (Last 12 months)550
- Downloads (Last 6 weeks)94
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

High-performance Cholesky factorization for GPU-only execution

GPGPU-10: Proceedings of the General Purpose GPUs

ABSTRACT

References

Cited By

Recommendations

Optimization for performance and energy for batched matrix computations on GPUs

Batched matrix computations on hardware accelerators based on GPUs

Towards batched linear solvers on accelerated hardware platforms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

High-performance Cholesky factorization for GPU-only execution

GPGPU-10: Proceedings of the General Purpose GPUs

ABSTRACT

References

Cited By

Recommendations

Optimization for performance and energy for batched matrix computations on GPUs

Batched matrix computations on hardware accelerators based on GPUs

Towards batched linear solvers on accelerated hardware platforms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media