ABSTRACT
OpenMP provides high-level parallel abstractions for programing heterogeneous systems based on acceleration technology. Active areas of research are looking to characterise the performance that can be expected from even the simplest combinations of directives and how they compare to versions manually implemented and tuned to a specific hardware accelerator. In this paper we analyze the performance of our implementation of the OpenMP 4.0 constructs on an NVIDIA GPU. For performance analysis we use LULESH, a complex proxy application provided by the Department of Energy as part of the CORAL benchmark suite.
NVIDIA provides CUDA as a native programming model for GPUs. We compare the performance of an OpenMP 4.0 version of LULESH obtained from a pre-existing OpenMP implementation with a functionally equivalent CUDA implementation. Alongside our performance analysis we also present the tuning steps required to obtain good performance when porting existing applications to a new accelerator architecture. Based on the analysis of the performance characteristics of our application we present an extension to the compiler code-synthesis process for combined OpenMP 4.0 offloading directives. The results obtained using our OpenMP compilation toolchain show performance within as low as 10% of native CUDA C/C++ for application kernels with low register counts.
- C. Bertolli, S. F. Antao, G.-T. Bercea, A. C. Jacob, A. E. Eichenberger, T. Chen, Z. Sura, H. Sung, G. Rokos, D. Appelhans, and K. O'Brien. Integrating GPU Support for OpenMP Offloading Directives into Clang. In Submitted to the 2015 LLVM Compiler Infrastructure in HPC, LLVM-HPC '15, 2015. Google ScholarDigital Library
- C. Bertolli, S. F. Antao, A. E. Eichenberger, K. O'Brien, Z. Sura, A. C. Jacob, T. Chen, and O. Sallenave. Coordinating GPU Threads for OpenMP 4.0 in LLVM. In Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM-HPC '14, pages 12--21, Piscataway, NJ, USA, 2014. IEEE Press. Google ScholarDigital Library
- Github repository for extended Clang implementation supporting OpenMP 4.0. https://github.com/clang-omp/clang_trunk.Google Scholar
- CORAL award announcement. http://energy.gov/articles/department-energy-awards-425-million-next-generation-supercomputing-technologies.Google Scholar
- CUDA toolkit webpage. http://docs.nvidia.com/cuda/index.html.Google Scholar
- A proposal for OpenMP offloading on GPUs in gcc. https://gcc.gnu.org/ml/gcc/2015-03/msg00331.html.Google Scholar
- I. Karlin. LULESH Programming Model and Performance Ports Overview. Technical report, LLNL, 2012. https://codesign.llnl.gov/pdfs/lulesh_Ports.pdf.Google Scholar
- I. Karlin, A. Bhatele, J. Keasler, B. L. Chamberlain, J. Cohen, Z. DeVito, R. Haque, D. Laney, E. Luke, F. Wang, D. Richards, M. Schulz, and C. Still. Exploring traditional and emerging parallel programming models using a proxy application. In 27th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2013), Boston, USA, may 2013. Google ScholarDigital Library
- The LLVM Compiler Infrastructure webpage. http://llvm.org/.Google Scholar
- LLVM Backend component for NVPTX archietecture (Nvidia GPUs). http://llvm.org/docs/NVPTXUsage.html.Google Scholar
- LULESH webpage. https://codesign.llnl.gov/lulesh.php.Google Scholar
- Hydrodynamics Challenge Problem, Lawrence Livermore National Laboratory. Technical Report LLNL-TR-490254.Google Scholar
- OpenMP Application Program Interface, version 4.0 edition, July 2013. http://www.openmp.org/mpdocuments/OpenMP4.0.0.pdf.Google Scholar
- Github repository for libomptarget offloading and GPU OpenMP runtime. https://github.com/clang-omp/libomptarget.Google Scholar
- OpenACC webpage. http://openacc.org.Google Scholar
- OpenCL standard webpage. https://www.khronos.org/opencl.Google Scholar
- OpenMP standard webpage. http://openmp.org/.Google Scholar
- G. Ozen, E. Ayguade, and J. Labarta. On the roles of the programmer, the compiler and the runtime system when programming accelerators in OpenMP. In L. DeRose, B. de Supinski, S. Olivier, B. Chapman, and M. MÃijller, editors, Using and Improving OpenMP for Devices, Tasks, and More, volume 8766 of Lecture Notes in Computer Science, pages 215--229. Springer International Publishing, 2014.Google Scholar
- CUDA-enable STL Thrust Library, March 2015. http://docs.nvidia.com/cuda/thrust.Google Scholar
- Y. Yang and H. Zhou. CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 93--106, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
Index Terms
- Performance analysis of OpenMP on a GPU using a CORAL proxy application
Recommendations
Performance analysis and optimization of Clang's OpenMP 4.5 GPU support
PMBS '16: Proceedings of the 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing SystemsThe Clang implementation of OpenMP® 4.5 now provides full support for the specification, offering the only open source option for targeting NVIDIA® GPUs. While using OpenMP allows portability across different architectures, matching native CUDA® ...
Performance of a code migration for the simulation of supersonic ejector flow to SMP, MIC, and GPU using OpenMP, OpenMP+LEO, and OpenACC directives
A serial source code for simulating a supersonic ejector flow is accelerated using parallelization based on OpenMP and OpenACC directives. The purpose is to reduce the development costs and to simplify the maintenance of the application due to the ...
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing SystemsWith fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
Comments