ABSTRACT
We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. Weachieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8x faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library.
- Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures. In The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016. IEEE, IEEE, Chicago, IL.Google Scholar
- Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs. In International Conference on Computational Science (ICCS'16). San Diego, CA.Google Scholar
- Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Hatem Ltaief, Raymond Namyst, Samuel Thibault, and Stanimire Tomov. 2010. Faster, Cheaper, Better -- a Hybridization Methodology to Develop Linear Algebra Software for GPUs. In GPU Computing Gems, Wen mei W. Hwu (Ed.). Vol. 2. Morgan Kaufmann.Google Scholar
- E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. 2009. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA projects. J. Phys.: Conf. Ser. 180, 1 (2009).Google ScholarCross Ref
- E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1992. LAPACK Users' Guide. SIAM, Philadelphia, PA. http://www.netlib.org/lapack/lug/. Google ScholarDigital Library
- C. Cao, J. Dongarra, P. Du, M. Gates, P. Luszczek, and S. Tomov. 2013. clMAGMA: High Performance Dense Linear Algebra with OpenCL, In The ACM International Conference Series. International workshop on OpenCL (may 13-14 2013). (submitted). Google ScholarDigital Library
- T. Dong, A. Haidar, P. Luszczek, A. Harris, S. Tomov, and J. Dongarra. 2014. LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU. In Proceedings of 16th IEEE International Conference on High Performance and Communications (HPCC 2014). Google ScholarDigital Library
- T. Dong, A. Haidar, S. Tomov, and J. Dongarra. 2014. A Fast Batched Cholesky Factorization on a GPU. In Proc. of 2014 International Conference on Parallel Processing (ICPP-2014).Google Scholar
- Jack Dongarra, Iain Duff, Mark Gates, Azzam Haidar, Sven Hammarling, Nicholas J. Higham, Jonathon Hogg, Pedro Valero-Lara, Samuel D. Relton, Stanimire Tomov, and Mawussi Zounon. 2016. A Proposed API for Batched Basic Linear Algebra Subprograms. MIMS EPrint 2016.25. Manchester Institute for Mathematical Sciences, The University of Manchester, UK. 20 pages. http://eprints.ma.man.ac.uk/2464/Google Scholar
- J. Dongarra, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and A. YarKhan. 2014. Model-Driven One-Sided Factorizations on Multicore Accelerated Systems. International Journal on Supercomputing Frontiers and Innovations 1, 1 (June 2014). Google ScholarDigital Library
- J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. 1990. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 16, 1 (March 1990), 1--17. Google ScholarDigital Library
- Massimiliano Fatica. 2009. Accelerating Linpack with CUDA on Heterogenous Clusters. In Proceedings of 2Nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2). ACM, New York, NY, USA, 46--51. Google ScholarDigital Library
- Azzam Haidar, Chongxiao Cao, Asim Yarkhan, Piotr Luszczek, Stanimire Tomov, Khairul Kabir, and Jack Dongarra. 2014. Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS '14). IEEE Computer Society, Washington, DC, USA, 491--500. Google ScholarDigital Library
- Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stanimire Tomov, and Jack Dongarra. Batched matrix computations on hardware accelerators based on GPUs. International Journal of High Performance Computing Applications (02/2015 ????) Google ScholarDigital Library
- Azzam Haidar, Tingxing Dong, Stanimire Tomov, Piotr Luszczek, and Jack Dongarra. 2015. Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations. In ISC High Performance. Springer, Springer, Frankfurt, Germany.Google Scholar
- Azzam Haidar, Jack Dongarra, Khairul Kabir, Mark Gates, Piotr Luszczek, Stanimire Tomov, and Yulu Jia. 2015. HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. Scientific Programming 23 (01-2015 2015). Google ScholarDigital Library
- Azzam Haidar, Stanimire Tomov, Konstantin Arturov, Murat Guney, Shane Story, and Jack Dongarra. 2016. LU, QR, and Cholesky Factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi. In IEEE High Performance Extreme Computing Conference (HPEC'16). IEEE, IEEE, Waltham, MA.Google Scholar
- Azzam Haidar, Stanimire Tomov, Piotr Luszczek, and Jack Dongarra. 2015. MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing. In 2015 IEEE High Performance Extreme Computing Conference (HPEC 15), (Best Paper Award). IEEE, IEEE, Waltham, MA.Google ScholarCross Ref
- Innovative Computing Laboratory, University of Tennessee 2010. PLASMA Users' Guide, Parallel Linear Algebra Software for Multicore Architectures, Version 2.0. Innovative Computing Laboratory, University of Tennessee. http://icl.cs.utk.edu/projectsfiles/plasma/pdf/users_guide.pdf.Google Scholar
- Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode, Stanimire Tomov, Guido Juckeland, Robert Dietrich, Duncan Poole, and Christopher Lamb. 2011. Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs. In Proc. of ICPP'11. IEEE Computer Society, Washington, DC, USA, 176--185. Google ScholarDigital Library
- NVIDIA Corporation 2016. cuSOLVER 8.0. (2016). Available at http://docs.nvidia.com/cuda/cusolver/.Google Scholar
- Peter E. Strazdins. 1998. Lookahead and Algorithmic Blocking Techniques Compared for Parallel Matrix Factorization. In 10th International Conference on Parallel and Distributed Computing and Systems, IASTED. Las Vegas, USA.Google Scholar
- Erich Strohmaier, Jack Dongarra, Horst Simon, and Martin Meuer. 1993-2016. TOP500 Supercomputer Sites. (1993-2016). Available from: http://www.top500.org/.Google Scholar
- S. Tomov, J. Dongarra, and M. Baboulin. 2010. Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems. Parellel Comput. Syst. Appl. 36, 5-6 (2010), 232--240. Google ScholarDigital Library
- Ichitaro Yamazaki, Stanimire Tomov, and Jack Dongarra. 2012. Onesided Dense Matrix Factorizations on a Multicore with Multiple {GPU} Accelerators. Procedia Computer Science 9, 0 (2012), 37--46. Proceedings of the International Conference on Computational Science, {ICCS} 2012Google ScholarCross Ref
- High-performance Cholesky factorization for GPU-only execution
Recommendations
Optimization for performance and energy for batched matrix computations on GPUs
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUsAs modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, ...
Batched matrix computations on hardware accelerators based on GPUs
Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need ...
Towards batched linear solvers on accelerated hardware platforms
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingAs hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for ...
Comments