skip to main content
10.1145/2832087.2832089acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Performance analysis of OpenMP on a GPU using a CORAL proxy application

Authors Info & Claims
Published:15 November 2015Publication History

ABSTRACT

OpenMP provides high-level parallel abstractions for programing heterogeneous systems based on acceleration technology. Active areas of research are looking to characterise the performance that can be expected from even the simplest combinations of directives and how they compare to versions manually implemented and tuned to a specific hardware accelerator. In this paper we analyze the performance of our implementation of the OpenMP 4.0 constructs on an NVIDIA GPU. For performance analysis we use LULESH, a complex proxy application provided by the Department of Energy as part of the CORAL benchmark suite.

NVIDIA provides CUDA as a native programming model for GPUs. We compare the performance of an OpenMP 4.0 version of LULESH obtained from a pre-existing OpenMP implementation with a functionally equivalent CUDA implementation. Alongside our performance analysis we also present the tuning steps required to obtain good performance when porting existing applications to a new accelerator architecture. Based on the analysis of the performance characteristics of our application we present an extension to the compiler code-synthesis process for combined OpenMP 4.0 offloading directives. The results obtained using our OpenMP compilation toolchain show performance within as low as 10% of native CUDA C/C++ for application kernels with low register counts.

References

  1. C. Bertolli, S. F. Antao, G.-T. Bercea, A. C. Jacob, A. E. Eichenberger, T. Chen, Z. Sura, H. Sung, G. Rokos, D. Appelhans, and K. O'Brien. Integrating GPU Support for OpenMP Offloading Directives into Clang. In Submitted to the 2015 LLVM Compiler Infrastructure in HPC, LLVM-HPC '15, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Bertolli, S. F. Antao, A. E. Eichenberger, K. O'Brien, Z. Sura, A. C. Jacob, T. Chen, and O. Sallenave. Coordinating GPU Threads for OpenMP 4.0 in LLVM. In Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM-HPC '14, pages 12--21, Piscataway, NJ, USA, 2014. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Github repository for extended Clang implementation supporting OpenMP 4.0. https://github.com/clang-omp/clang_trunk.Google ScholarGoogle Scholar
  4. CORAL award announcement. http://energy.gov/articles/department-energy-awards-425-million-next-generation-supercomputing-technologies.Google ScholarGoogle Scholar
  5. CUDA toolkit webpage. http://docs.nvidia.com/cuda/index.html.Google ScholarGoogle Scholar
  6. A proposal for OpenMP offloading on GPUs in gcc. https://gcc.gnu.org/ml/gcc/2015-03/msg00331.html.Google ScholarGoogle Scholar
  7. I. Karlin. LULESH Programming Model and Performance Ports Overview. Technical report, LLNL, 2012. https://codesign.llnl.gov/pdfs/lulesh_Ports.pdf.Google ScholarGoogle Scholar
  8. I. Karlin, A. Bhatele, J. Keasler, B. L. Chamberlain, J. Cohen, Z. DeVito, R. Haque, D. Laney, E. Luke, F. Wang, D. Richards, M. Schulz, and C. Still. Exploring traditional and emerging parallel programming models using a proxy application. In 27th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2013), Boston, USA, may 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. The LLVM Compiler Infrastructure webpage. http://llvm.org/.Google ScholarGoogle Scholar
  10. LLVM Backend component for NVPTX archietecture (Nvidia GPUs). http://llvm.org/docs/NVPTXUsage.html.Google ScholarGoogle Scholar
  11. LULESH webpage. https://codesign.llnl.gov/lulesh.php.Google ScholarGoogle Scholar
  12. Hydrodynamics Challenge Problem, Lawrence Livermore National Laboratory. Technical Report LLNL-TR-490254.Google ScholarGoogle Scholar
  13. OpenMP Application Program Interface, version 4.0 edition, July 2013. http://www.openmp.org/mpdocuments/OpenMP4.0.0.pdf.Google ScholarGoogle Scholar
  14. Github repository for libomptarget offloading and GPU OpenMP runtime. https://github.com/clang-omp/libomptarget.Google ScholarGoogle Scholar
  15. OpenACC webpage. http://openacc.org.Google ScholarGoogle Scholar
  16. OpenCL standard webpage. https://www.khronos.org/opencl.Google ScholarGoogle Scholar
  17. OpenMP standard webpage. http://openmp.org/.Google ScholarGoogle Scholar
  18. G. Ozen, E. Ayguade, and J. Labarta. On the roles of the programmer, the compiler and the runtime system when programming accelerators in OpenMP. In L. DeRose, B. de Supinski, S. Olivier, B. Chapman, and M. MÃijller, editors, Using and Improving OpenMP for Devices, Tasks, and More, volume 8766 of Lecture Notes in Computer Science, pages 215--229. Springer International Publishing, 2014.Google ScholarGoogle Scholar
  19. CUDA-enable STL Thrust Library, March 2015. http://docs.nvidia.com/cuda/thrust.Google ScholarGoogle Scholar
  20. Y. Yang and H. Zhou. CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 93--106, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Performance analysis of OpenMP on a GPU using a CORAL proxy application

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            PMBS '15: Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems
            November 2015
            105 pages
            ISBN:9781450340090
            DOI:10.1145/2832087

            Copyright © 2015 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 15 November 2015

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            PMBS '15 Paper Acceptance Rate9of22submissions,41%Overall Acceptance Rate9of22submissions,41%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader