ABSTRACT
GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multi-level partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework is able to attain high performance on distributed-memory clusters without GPUs, and shared-system multiGPUs.
- E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, and S. Tomov. LU factorization for accelerator-based systems. ICL Technical Report ICL-UT-10-05, Innovative Computing Laboratory, University of Tennessee, 2010.Google Scholar
- E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and S. Tomov. QR factorization on a multicore node enhanced with multiple GPU accelerators. In IPDPS 2011, Alaska, USA, 2011. Google ScholarDigital Library
- E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, J. Roman, S. Thibault, and S. Tomov. Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators. In Symposium on Application Accelerators in High Performance Computing (SAAHPC), Knoxville, USA, 2010.Google Scholar
- E. Agullo, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, J. Langou, H. Ltaief, P. Luszczek, and A. YarKhan. PLASMA Users' Guide. Technical report, ICL, UTK, 2011.Google Scholar
- E. Anderson, Z. Bai, C. Bischof, L. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. SIAM, 1992. Google ScholarDigital Library
- C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. : Pract. Exper., Special Issue: Euro-Par 2009, 23:187--198, Feb. 2011. Google ScholarDigital Library
- E. Ayguadé, R. M. Badia, F. D. Igual, J. Labarta, R. Mayo, and E. S. Quintana-Ort'ı. An extension of the StarSs programming model for platforms with multiple GPUs. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing, Euro-Par '09, pages 851--862. Springer-Verlag, 2009. Google ScholarDigital Library
- G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Communication-optimal parallel and sequential Cholesky decomposition. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, SPAA '09, pages 245--252. ACM, 2009. Google ScholarDigital Library
- O. Beaumont, V. Boudet, A. Petitet, F. Rastello, and Y. Robert. A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers). IEEE Transactions on Computers, 50:1052--1070, 2001. Google ScholarDigital Library
- L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. Whaley. ScaLAPACK Users' Guide. SIAM, 1997. Google ScholarDigital Library
- P. Boulet, J. Dongarra, Y. Robert, and F. Vivien. Static tiling for heterogeneous computing platforms. Parallel Computing, 25(5):547 -- 568, 1999.Google ScholarCross Ref
- J. W. Demmel, L. Grigori, M. F. Hoemmen, and J. Langou. Communication-optimal parallel and sequential QR and LU factorizations. LAPACK Working Note 204, UTK, August 2008.Google Scholar
- J. J. Dongarra, P. Luszczek, and A. Petitet. The LINPACK Benchmark: past, present, and future. Concurrency and Computation: Practice and Experience, 15:803--820, 2003.Google ScholarCross Ref
- M. Fatica. Accelerating Linpack with CUDA on heterogenous clusters. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 46--51. ACM, 2009. Google ScholarDigital Library
- M. Fogué, F. D. Igual, E. S. Quintana-ortí, and R. V. D. Geijn. Retargeting PLAPACK to clusters with hardware accelerators. FLAME Working Note 42, 2010.Google Scholar
- J. R. Humphrey, D. K. Price, K. E. Spagnoli, A. L. Paolini, and E. J. Kelmelis. CULA: Hybrid GPU accelerated linear algebra routines. In SPIE Defense and Security Symposium (DSS), April 2010.Google ScholarCross Ref
- A. Lastovetsky and R. Reddy. Data distribution for dense factorization on computers with memory heterogeneity. Parallel Comput., 33:757--779, December 2007. Google ScholarDigital Library
- V. Marjanović, J. Labarta, E. Ayguadé, and M. Valero. Overlapping communication and computation by using a hybrid MPI/SMPSs approach. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS '10, pages 5--16. ACM, 2010. Google ScholarDigital Library
- NVIDIA. CUDA Toolkit 4.0 CUBLAS Library, 2011.Google Scholar
- G. Quintana-Ort'ı, F. D. Igual, E. S. Quintana-Ort'ı, and R. A. van de Geijn. Solving dense linear systems on platforms with multiple hardware accelerators. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '09, pages 121--130. ACM, 2009. Google ScholarDigital Library
- C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. PTask: Operating system abstractions to manage GPUs as compute devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 233--248. ACM, 2011. Google ScholarDigital Library
- S. Sharma, C.-H. Hsu, and W. chun Feng. Making a case for a Green500 list. In IEEE International Parallel and Distributed Processing Symposium (IPDPS 2006)/ Workshop on High Performance - Power Aware Computing, 2006. Google ScholarDigital Library
- F. Song, S. Tomov, and J. Dongarra. Efficient support for matrix computations on heterogeneous multi-core and multi-GPU architectures. LAPACK Working Note 250, UTK, June 2011.Google Scholar
- S. Tomov, R. Nath, P. Du, and J. Dongarra. MAGMA Users' Guide. Technical report, ICL, UTK, 2011.Google Scholar
- J. Vetter, R. Glassbrook, J. Dongarra, K. Schwan, B. Loftis, S. McNally, J. Meredith, J. Rogers, P. Roth, K. Spafford, and S. Yalamanchili. Keeneland: Bringing heterogeneous GPU computing to the computational science community. Computing in Science Engineering, 13(5):90 --95, sept.-oct. 2011. Google ScholarDigital Library
Index Terms
- A scalable framework for heterogeneous GPU-based clusters
Recommendations
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems
ICS '12: Proceedings of the 26th ACM international conference on SupercomputingWe present a new approach to utilizing all CPU cores and all GPUs on heterogeneous multicore and multi-GPU systems to support dense matrix computations efficiently. The main idea is that we treat a heterogeneous system as a distributed-memory machine, ...
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters
IPDPS '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing SymposiumClusters with accelerators at each node have emerged as the dominant high-end architecture in recent years. Such systems can be extremely hard to program because of the underlying heterogeneity and the need for exploiting parallelism at multiple levels. ...
InfiniBand Verbs on GPU
Due to their massive parallelism and high performance per Watt, GPUs have gained high popularity in high-performance computing and are a strong candidate for future exascale systems. But communication and data transfer in GPU-accelerated systems remain ...
Comments