ABSTRACT
Computational accelerators, such as manycore NVIDIA GPUs, Intel Xeon Phi and FPGAs, are becoming common in work-stations, servers and supercomputers for scientific and engineering applications. Efficiently exploiting the massive parallelism these accelerators provide requires the designs and implementations of productive programming models.
In this paper, we explore support of multiple accelerators in high-level programming models. We design novel language extensions to OpenMP to support offloading data and computation regions to multiple accelerators (devices). These extensions allow for distributing data and computation among a list of devices via easy-to-use interfaces, including specifying the distribution of multi-dimensional arrays and declaring shared data regions among accelerators. Computation distribution is realized by partitioning a loop iteration space among accelerators. We implement mechanisms to marshal/unmarshal and to move data of non-contiguous array subregions and shared regions between accelerators without involving CPUs. We design reduction techniques that work across multiple accelerators. Combined compiler and runtime support is designed to manage multiple GPUs using asynchronous operations and threading mechanisms. We implement our solutions for NVIDIA GPUs and demonstrate through example OpenMP codes the effectiveness of our solutions for performance improvement.
- Global Arrays Toolkit. http://http://hpc.pnl.gov/globalarrays.Google Scholar
- OpenACC: Directives for Accelerators. http://www.openacc-standard.org.Google Scholar
- C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience, 23(2):187--198, 2011. Google ScholarDigital Library
- J. Bueno, J. Planas, A. Duran, R. M. Badia, X. Martorell, E. Ayguade, and J. Labarta. Productive Programming of GPU Clusters with OmpSs. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 557--568. IEEE, 2012. Google ScholarDigital Library
- J. Enmyren and C. W. Kessler. Skepu: A multi-backend skeleton programming library for multi-gpu systems. In Proceedings of the Fourth International Workshop on High-level Parallel Programming and Applications, HLPP '10, pages 5--14, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- M. P. Forum. Mpi: A message-passing interface standard. Technical report, Knoxville, TN, USA, 1994. Google ScholarDigital Library
- T. Gautier, J. V. Lima, N. Maillard, and B. Raffin. Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures. Parallel and Distributed Processing Symposium, International, 0: 1299--1308, 2013. Google ScholarDigital Library
- A. Hart, R. Ansaloni, and A. Gray. Porting and Scaling OpenACC Applications on Massively-parallel, GPU-accelerated Supercomputers. The European Physical Journal Special Topics, 210(1):5--16, 2012.Google ScholarCross Ref
- T. Komada, S. Miwa, H. Nakamura, and N. Maruyama. Integrating Multi-GPU Execution in an OpenACC Compiler. In ICPP '13: Proceedings of the 42nd International Conference on Parallel Processing, pages 260--269, 2013. Google ScholarDigital Library
- J. M. Levesque, R. Sankaran, and R. Grout. Hybridizing S3D into an Exascale Application using OpenACC: An approach for moving to multi-petaflops and beyond. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 15. IEEE Computer Society Press, 2012. Google ScholarDigital Library
- C. Liao, Y. Yan, B. R. de Supinski, D. J. Quinlan, and B. Chapman. Early Experiences with the OpenMP Accelerator Model. In OpenMP in the Era of Low Power Devices and Accelerators (IWOMP'13), pages 84--98. Springer, 2013.Google ScholarCross Ref
- T. Lutz, C. Fensch, and M. Cole. Partans: An autotuning framework for stencil computation on multi-gpu systems. ACM Trans. Archit. Code Optim., 9(4): 59:1--59:24, Jan. 2013. Google ScholarDigital Library
- OpenMP Architecture Review Board. The OpenMP API Specification for Parallel Programming. http://www.openmp.org/.Google Scholar
- G. Quintana-Ortí, F. D. Igual, E. S. Quintana-Ortí, and R. A. van de Geijn. Solving dense linear systems on platforms with multiple hardware accelerators. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '09, pages 121--130, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- R. Reyes, I. López-Rodríguez, J. J. Fumero, and F. de Sande. accULL: An OpenACC Implementation with CUDA and OpenCL Support. In Euro-Par 2012 Parallel Processing, pages 871--882. Springer, 2012. Google ScholarDigital Library
- C. Rice University. High performance fortran language specification. SIGPLAN Fortran Forum, 12(4):1--86, Dec. 1993. Google ScholarDigital Library
- F. Song and J. Dongarra. A scalable framework for heterogeneous gpu-based clusters. In Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '12, pages 91--100, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- M. Steuwer and S. Gorlatch. Skelcl: Enhancing opencl for high-level programming of multi-gpu systems. In V. Malyshkin, editor, Parallel Computing Technologies, volume 7979 of Lecture Notes in Computer Science, pages 258--272. Springer Berlin Heidelberg, 2013.Google ScholarDigital Library
- X. Tian, R. Xu, Y. Yan, Z. Yun, S. Chandrasekaran, and B. Chapman. Compiling a High-Level Directive-Based Programming Model for GPGPUs, 2013.Google Scholar
- R. Xu, S. Chandrasekaran, and B. Chapman. Exploring Programming Multi-GPUs using OpenMP & OpenACC-based Hybrid Model. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, pages 1169--1176. IEEE Computer Society, 2013. Google ScholarDigital Library
- R. Xu, X. Tian, Y. Yan, S. Chandrasekaran, and B. Chapman. Reduction operations in parallel loops for gpgpus. In Proceedings of Programming Models and Applications on Multicores and Manycores, PMAM'14, pages 10:10--10:20, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
Index Terms
- Supporting multiple accelerators in high-level programming models
Recommendations
Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and AnalysisHybridization is the process of converting an application with a single level of parallelism to an application with multiple levels of parallelism. Over the past 15 years a majority of the applications that run on High Performance Computing systems have ...
Hybridizing S3D into an exascale application using OpenACC: an approach for moving to multi-petaflops and beyond
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisHybridization is the process of converting an application with a single level of parallelism to an application with multiple levels of parallelism. Over the past 15 years a majority of the applications that run on High Performance Computing systems have ...
Petascale computing with accelerators
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programmingA trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an ...
Comments