research-article

Implementing implicit OpenMP data sharing on GPUs

Authors:
Gheorghe-Teodor Bercea

IBM TJ Watson Research Center, Yorktown Heights, NY, USA

IBM TJ Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Carlo Bertolli

IBM TJ Watson Research Center, Yorktown Heights, NY, USA

IBM TJ Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Arpith C. Jacob

IBM TJ Watson Research Center, Yorktown Heights, NY, USA

IBM TJ Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Alexandre Eichenberger

IBM TJ Watson Research Center, Yorktown Heights, NY, USA

IBM TJ Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Alexey Bataev

IBM TJ Watson Research Center, Yorktown Heights, NY, USA

IBM TJ Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Georgios Rokos

IBM TJ Watson Research Center, Yorktown Heights, NY, USA

IBM TJ Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Hyojin Sung

IBM TJ Watson Research Center, Yorktown Heights, NY, USA

IBM TJ Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Tong Chen

IBM TJ Watson Research Center, Yorktown Heights, NY, USA

IBM TJ Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Kevin O'Brien

IBM TJ Watson Research Center, Yorktown Heights, NY, USA

IBM TJ Watson Research Center, Yorktown Heights, NY, USA
View Profile

LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPCNovember 2017Article No.: 5Pages 1–12https://doi.org/10.1145/3148173.3148189

Published:12 November 2017Publication History

LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC

Pages 1–12

ABSTRACT

OpenMP is a shared memory programming model which supports the offloading of target regions to accelerators such as NVIDIA GPUs. The implementation in Clang/LLVM aims to deliver a generic GPU compilation toolchain that supports both the native CUDA C/C++ and the OpenMP device offloading models. There are situations where the semantics of OpenMP and those of CUDA diverge. One such example is the policy for implicitly handling local variables. In CUDA, local variables are implicitly mapped to thread local memory and thus become private to a CUDA thread. In OpenMP, due to semantics that allow the nesting of regions executed by different numbers of threads, variables need to be implicitly shared among the threads of a contention group.

In this paper we introduce a re-design of the OpenMP device data sharing infrastructure that is responsible for the implicit sharing of local variables in the Clang/LLVM toolchain. We introduce a new data sharing infrastructure that lowers implicitly shared variables to the shared memory of the GPU.

We measure the amount of shared memory used by our scheme in cases that involve scalar variables and statically allocated arrays. The evaluation is carried out by offloading to K40 and P100 NVIDIA GPUs. For scalar variables the pressure on shared memory is relatively low, under 26% of shared memory utilization for the K40, and does not negatively impact occupancy. The limiting occupancy factor in that case is register pressure. The data sharing scheme offers the users a simple memory model for controlling the implicit allocation of device shared memory.

References

Samuel F. Antao, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, and Kevin O'Brien. 2016. Offloading Support for OpenMP in Clang and LLVM. In Proceedings of the Third Workshop on LLVM Compiler Infrastructure in HPC (LLVM-HPC '16). IEEE Press, Piscataway, NJ, USA, 1--11. https://doi.org/10.1109/LLVM-HPC.2016.6 Google ScholarCross Ref
Gheorghe-Teodor Bercea, Carlo Bertolli, Samuel F. Antao, Arpith C. Jacob, Alexandre E. Eichenberger, Tong Chen, Zehra Sura, Hyojin Sung, Georgios Rokos, David Appelhans, and Kevin O'Brien. 2015. Performance Analysis of OpenMP on a GPU Using a CORAL Proxy Application. In Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems (PMBS '15). ACM, New York, NY, USA, Article 2, 11 pages. https://doi.org/10.1145/2832087.2832089Google ScholarDigital Library
Carlo Bertolli, Samuel F. Antao, Gheorghe-Teodor Bercea, Arpith C. Jacob, Alexandre E. Eichenberger, Tong Chen, Zehra Sura, Hyojin Sung, Georgios Rokos, David Appelhans, and Kevin O'Brien. 2015. Integrating GPU Support for OpenMP Offloading Directives into Clang. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC (LLVM '15). ACM, New York, NY, USA, Article 5, 11 pages. https://doi.org/10.1145/2833157.2833161Google ScholarDigital Library
Carlo Bertolli, Samuel F. Antao, Alexandre E. Eichenberger, Kevin O'Brien, Zehra Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GPU Threads for OpenMP 4.0 in LLVM. In Proceedings of the 2014 LLVM Compiler Infrastructure in HPC (LLVM-HPC '14). IEEE Press, Piscataway, NJ, USA, 12--21. https://doi.org/10.1109/LLVM-HPC.2014.10 Google ScholarDigital Library
Arpith C. Jacob, Alexandre E. Eichenberger, Hyojin Sung, Samuel F. Antao, Gheorghe-Teodor Bercea, Carlo Bertolli, Alexey Bataev, Tian Jin, Tong Chen, Zehra Sura, Georgios Rokos, and Kevin O'Brien. [n. d.]. Efficient Fork-Join on GPUs through Warp Specialization. To be published at the IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2017) ([n. d.]).Google Scholar
M. Martineau, S. McIntosh-Smith, C. Bertolli, A. C. Jacob, S. F. Antao, A. Eichenberger, G. T. Bercea, T. Chen, T. Jin, K. O'Brien, G. Rokos, H. Sung, and Z. Sura. 2016. Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support. In 2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 54--64. https://doi.org/10.1109/PMBS.2016.011Google Scholar
All members of the OpenMP Language Working Group. 2017. OpenMP Technical Report 4: Version 5.0 Preview 1. Technical Report. The OpenMP ARB.Google Scholar
Eric Stotzer, Ajay Jayaraj, Murtaza Ali, Arnon Friedmann, Gaurav Mitra, Alistair P. Rendell, and Ian Lintault. 2013. OpenMP on the Low-Power TI Keystone II ARM/DSP System-on-Chip. Springer Berlin Heidelberg, Berlin, Heidelberg, 114--127. https://doi.org/10.1007/978-3-642-40698-0_9Google Scholar
Yi Yang and Huiyang Zhou. 2014. CUDA-NP: Realizing Nested Thread-level Parallelism in GPGPU Applications. SIGPLAN Not. 49, 8 (Feb. 2014), 93--106. https://doi.org/10.1145/2692916.2555254 Google ScholarDigital Library

Recommendations

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Read More
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming ...
Read More
Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Heterogeneous computing come with tremendous potential and is a leading candidate for scientific applications that are becoming more and more complex. Accelerators such as GPUs whose computing momentum is growing faster than ever offer application ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC
November 2017
106 pages
ISBN:9781450355650
DOI:10.1145/3148173

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Clang
OpenMP
data sharing
shared memory
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
LLVM-HPC'17 Paper Acceptance Rate9of10submissions,90%Overall Acceptance Rate16of22submissions,73%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 172
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Implementing implicit OpenMP data sharing on GPUs

LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC

ABSTRACT

References

Cited By

Recommendations

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption

OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Implementing implicit OpenMP data sharing on GPUs

LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC

ABSTRACT

References

Cited By

Recommendations

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption

OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media