skip to main content
10.1145/2312005.2312025acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article

A scalable framework for heterogeneous GPU-based clusters

Published:25 June 2012Publication History

ABSTRACT

GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multi-level partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework is able to attain high performance on distributed-memory clusters without GPUs, and shared-system multiGPUs.

References

  1. E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, and S. Tomov. LU factorization for accelerator-based systems. ICL Technical Report ICL-UT-10-05, Innovative Computing Laboratory, University of Tennessee, 2010.Google ScholarGoogle Scholar
  2. E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and S. Tomov. QR factorization on a multicore node enhanced with multiple GPU accelerators. In IPDPS 2011, Alaska, USA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, J. Roman, S. Thibault, and S. Tomov. Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators. In Symposium on Application Accelerators in High Performance Computing (SAAHPC), Knoxville, USA, 2010.Google ScholarGoogle Scholar
  4. E. Agullo, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, J. Langou, H. Ltaief, P. Luszczek, and A. YarKhan. PLASMA Users' Guide. Technical report, ICL, UTK, 2011.Google ScholarGoogle Scholar
  5. E. Anderson, Z. Bai, C. Bischof, L. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. SIAM, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. : Pract. Exper., Special Issue: Euro-Par 2009, 23:187--198, Feb. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Ayguadé, R. M. Badia, F. D. Igual, J. Labarta, R. Mayo, and E. S. Quintana-Ort'ı. An extension of the StarSs programming model for platforms with multiple GPUs. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing, Euro-Par '09, pages 851--862. Springer-Verlag, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Communication-optimal parallel and sequential Cholesky decomposition. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, SPAA '09, pages 245--252. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. O. Beaumont, V. Boudet, A. Petitet, F. Rastello, and Y. Robert. A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers). IEEE Transactions on Computers, 50:1052--1070, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. Whaley. ScaLAPACK Users' Guide. SIAM, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Boulet, J. Dongarra, Y. Robert, and F. Vivien. Static tiling for heterogeneous computing platforms. Parallel Computing, 25(5):547 -- 568, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  12. J. W. Demmel, L. Grigori, M. F. Hoemmen, and J. Langou. Communication-optimal parallel and sequential QR and LU factorizations. LAPACK Working Note 204, UTK, August 2008.Google ScholarGoogle Scholar
  13. J. J. Dongarra, P. Luszczek, and A. Petitet. The LINPACK Benchmark: past, present, and future. Concurrency and Computation: Practice and Experience, 15:803--820, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  14. M. Fatica. Accelerating Linpack with CUDA on heterogenous clusters. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 46--51. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Fogué, F. D. Igual, E. S. Quintana-ortí, and R. V. D. Geijn. Retargeting PLAPACK to clusters with hardware accelerators. FLAME Working Note 42, 2010.Google ScholarGoogle Scholar
  16. J. R. Humphrey, D. K. Price, K. E. Spagnoli, A. L. Paolini, and E. J. Kelmelis. CULA: Hybrid GPU accelerated linear algebra routines. In SPIE Defense and Security Symposium (DSS), April 2010.Google ScholarGoogle ScholarCross RefCross Ref
  17. A. Lastovetsky and R. Reddy. Data distribution for dense factorization on computers with memory heterogeneity. Parallel Comput., 33:757--779, December 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Marjanović, J. Labarta, E. Ayguadé, and M. Valero. Overlapping communication and computation by using a hybrid MPI/SMPSs approach. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS '10, pages 5--16. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. NVIDIA. CUDA Toolkit 4.0 CUBLAS Library, 2011.Google ScholarGoogle Scholar
  20. G. Quintana-Ort'ı, F. D. Igual, E. S. Quintana-Ort'ı, and R. A. van de Geijn. Solving dense linear systems on platforms with multiple hardware accelerators. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '09, pages 121--130. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. PTask: Operating system abstractions to manage GPUs as compute devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 233--248. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Sharma, C.-H. Hsu, and W. chun Feng. Making a case for a Green500 list. In IEEE International Parallel and Distributed Processing Symposium (IPDPS 2006)/ Workshop on High Performance - Power Aware Computing, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. F. Song, S. Tomov, and J. Dongarra. Efficient support for matrix computations on heterogeneous multi-core and multi-GPU architectures. LAPACK Working Note 250, UTK, June 2011.Google ScholarGoogle Scholar
  24. S. Tomov, R. Nath, P. Du, and J. Dongarra. MAGMA Users' Guide. Technical report, ICL, UTK, 2011.Google ScholarGoogle Scholar
  25. J. Vetter, R. Glassbrook, J. Dongarra, K. Schwan, B. Loftis, S. McNally, J. Meredith, J. Rogers, P. Roth, K. Spafford, and S. Yalamanchili. Keeneland: Bringing heterogeneous GPU computing to the computational science community. Computing in Science Engineering, 13(5):90 --95, sept.-oct. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A scalable framework for heterogeneous GPU-based clusters

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
          June 2012
          348 pages
          ISBN:9781450312134
          DOI:10.1145/2312005

          Copyright © 2012 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 June 2012

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate447of1,461submissions,31%

          Upcoming Conference

          SPAA '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader