research-article

A scalable framework for heterogeneous GPU-based clusters

Authors:
Fengguang Song

University of Tennessee, Knoxville, TN, USA

University of Tennessee, Knoxville, TN, USA
View Profile

,
Jack Dongarra

University of Tennessee, Knoxville, TN, USA

University of Tennessee, Knoxville, TN, USA
View Profile

SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architecturesJune 2012Pages 91–100https://doi.org/10.1145/2312005.2312025

Published:25 June 2012Publication History

SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

Pages 91–100

ABSTRACT

GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multi-level partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework is able to attain high performance on distributed-memory clusters without GPUs, and shared-system multiGPUs.

References

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, and S. Tomov. LU factorization for accelerator-based systems. ICL Technical Report ICL-UT-10-05, Innovative Computing Laboratory, University of Tennessee, 2010.Google Scholar
E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and S. Tomov. QR factorization on a multicore node enhanced with multiple GPU accelerators. In IPDPS 2011, Alaska, USA, 2011. Google ScholarDigital Library
E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, J. Roman, S. Thibault, and S. Tomov. Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators. In Symposium on Application Accelerators in High Performance Computing (SAAHPC), Knoxville, USA, 2010.Google Scholar
E. Agullo, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, J. Langou, H. Ltaief, P. Luszczek, and A. YarKhan. PLASMA Users' Guide. Technical report, ICL, UTK, 2011.Google Scholar
E. Anderson, Z. Bai, C. Bischof, L. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. SIAM, 1992. Google ScholarDigital Library
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. : Pract. Exper., Special Issue: Euro-Par 2009, 23:187--198, Feb. 2011. Google ScholarDigital Library
E. Ayguadé, R. M. Badia, F. D. Igual, J. Labarta, R. Mayo, and E. S. Quintana-Ort'ı. An extension of the StarSs programming model for platforms with multiple GPUs. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing, Euro-Par '09, pages 851--862. Springer-Verlag, 2009. Google ScholarDigital Library
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Communication-optimal parallel and sequential Cholesky decomposition. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, SPAA '09, pages 245--252. ACM, 2009. Google ScholarDigital Library
O. Beaumont, V. Boudet, A. Petitet, F. Rastello, and Y. Robert. A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers). IEEE Transactions on Computers, 50:1052--1070, 2001. Google ScholarDigital Library
L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. Whaley. ScaLAPACK Users' Guide. SIAM, 1997. Google ScholarDigital Library
P. Boulet, J. Dongarra, Y. Robert, and F. Vivien. Static tiling for heterogeneous computing platforms. Parallel Computing, 25(5):547 -- 568, 1999.Google ScholarCross Ref
J. W. Demmel, L. Grigori, M. F. Hoemmen, and J. Langou. Communication-optimal parallel and sequential QR and LU factorizations. LAPACK Working Note 204, UTK, August 2008.Google Scholar
J. J. Dongarra, P. Luszczek, and A. Petitet. The LINPACK Benchmark: past, present, and future. Concurrency and Computation: Practice and Experience, 15:803--820, 2003.Google ScholarCross Ref
M. Fatica. Accelerating Linpack with CUDA on heterogenous clusters. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 46--51. ACM, 2009. Google ScholarDigital Library
M. Fogué, F. D. Igual, E. S. Quintana-ortí, and R. V. D. Geijn. Retargeting PLAPACK to clusters with hardware accelerators. FLAME Working Note 42, 2010.Google Scholar
J. R. Humphrey, D. K. Price, K. E. Spagnoli, A. L. Paolini, and E. J. Kelmelis. CULA: Hybrid GPU accelerated linear algebra routines. In SPIE Defense and Security Symposium (DSS), April 2010.Google ScholarCross Ref
A. Lastovetsky and R. Reddy. Data distribution for dense factorization on computers with memory heterogeneity. Parallel Comput., 33:757--779, December 2007. Google ScholarDigital Library
V. Marjanović, J. Labarta, E. Ayguadé, and M. Valero. Overlapping communication and computation by using a hybrid MPI/SMPSs approach. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS '10, pages 5--16. ACM, 2010. Google ScholarDigital Library
NVIDIA. CUDA Toolkit 4.0 CUBLAS Library, 2011.Google Scholar
G. Quintana-Ort'ı, F. D. Igual, E. S. Quintana-Ort'ı, and R. A. van de Geijn. Solving dense linear systems on platforms with multiple hardware accelerators. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '09, pages 121--130. ACM, 2009. Google ScholarDigital Library
C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. PTask: Operating system abstractions to manage GPUs as compute devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 233--248. ACM, 2011. Google ScholarDigital Library
S. Sharma, C.-H. Hsu, and W. chun Feng. Making a case for a Green500 list. In IEEE International Parallel and Distributed Processing Symposium (IPDPS 2006)/ Workshop on High Performance - Power Aware Computing, 2006. Google ScholarDigital Library
F. Song, S. Tomov, and J. Dongarra. Efficient support for matrix computations on heterogeneous multi-core and multi-GPU architectures. LAPACK Working Note 250, UTK, June 2011.Google Scholar
S. Tomov, R. Nath, P. Du, and J. Dongarra. MAGMA Users' Guide. Technical report, ICL, UTK, 2011.Google Scholar
J. Vetter, R. Glassbrook, J. Dongarra, K. Schwan, B. Loftis, S. McNally, J. Meredith, J. Rogers, P. Roth, K. Spafford, and S. Yalamanchili. Keeneland: Bringing heterogeneous GPU computing to the computational science community. Computing in Science Engineering, 13(5):90 --95, sept.-oct. 2011. Google ScholarDigital Library

Index Terms

A scalable framework for heterogeneous GPU-based clusters

Recommendations

Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

We present a new approach to utilizing all CPU cores and all GPUs on heterogeneous multicore and multi-GPU systems to support dense matrix computations efficiently. The main idea is that we treat a heterogeneous system as a distributed-memory machine, ...
Read More
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters
IPDPS '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium

Clusters with accelerators at each node have emerged as the dominant high-end architecture in recent years. Such systems can be extremely hard to program because of the underlying heterogeneity and the need for exploiting parallelism at multiple levels. ...
Read More
InfiniBand Verbs on GPU

Due to their massive parallelism and high performance per Watt, GPUs have gained high popularity in high-performance computing and are a strong candidate for future exascale systems. But communication and data transfer in GPU-accelerated systems remain ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
June 2012
348 pages
ISBN:9781450312134
DOI:10.1145/2312005
General Chair:
Guy Blelloch
Carnegie Mellon University, USA
,
Program Chair:
Maurice Herlihy
Brown University, USA
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 June 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributed runtime
heterogeneous clusters
hybrid CPU-GPU architectures
linear algebra
manycore scheduling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate447of1,461submissions,31%
Upcoming Conference
SPAA '24

Sponsor:

sigact

sigact

36th ACM Symposium on Parallelism in Algorithms and Architectures

June 17 - 21, 2024

Nantes , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 27
  Total Citations
  View Citations
- 557
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A scalable framework for heterogeneous GPU-based clusters

SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

ABSTRACT

References

Cited By

Index Terms

Recommendations

Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters

InfiniBand Verbs on GPU