skip to main content
10.1145/2983990.2984032acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article

Portable inter-workgroup barrier synchronisation for GPUs

Published: 19 October 2016 Publication History

Abstract

Despite the growing popularity of GPGPU programming, there is not yet a portable and formally-specified barrier that one can use to synchronise across workgroups. Moreover, the occupancy-bound execution model of GPUs breaks assumptions inherent in traditional software execution barriers, exposing them to deadlock. We present an occupancy discovery protocol that dynamically discovers a safe estimate of the occupancy for a given GPU and kernel, allowing for a starvation-free (and hence, deadlock-free) inter-workgroup barrier by restricting the number of workgroups according to this estimate. We implement this idea by adapting an existing, previously non-portable, GPU inter-workgroup barrier to use OpenCL 2.0 atomic operations, and prove that the barrier meets its natural specification in terms of synchronisation.
We assess the portability of our approach over eight GPUs spanning four vendors, comparing the performance of our method against alternative methods. Our key findings include: (1) the recall of our discovery protocol is nearly 100%; (2) runtime comparisons vary substantially across GPUs and applications; and (3) our method provides portable and safe inter-workgroup synchronisation across the applications we study.

References

[1]
J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson. GPU concurrency: Weak behaviours and programming assumptions. In ASPLOS, pages 577–591. ACM, 2015.
[2]
M. Batty, M. Dodds, and A. Gotsman. Library abstraction for C/C++ concurrency. In POPL, pages 235–248. ACM, 2013.
[3]
M. Batty, A. F. Donaldson, and J. Wickerson. Overhauling SC atomics in C11 and OpenCL. In POPL, pages 634–648. ACM, 2016.
[4]
A. Betts, N. Chong, A. F. Donaldson, J. Ketema, S. Qadeer, P. Thomson, and J. Wickerson. The design and implementation of a verification technique for GPU kernels. ACM Trans. Program. Lang. Syst., 37(3):10, 2015.
[5]
M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on GPUs. In IISWC, pages 141–151. IEEE, 2012.
[6]
D. Cederman and P. Tsigas. On dynamic load balancing on graphics processors. In SIGGRAPH, pages 57–64. Eurographics Association, 2008.
[7]
S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Understanding irregular GPGPU graph applications. In IISWC, pages 185–195. IEEE, 2013.
[8]
P. Collingbourne, A. F. Donaldson, J. Ketema, and S. Qadeer. Interleaving and lock-step semantics for analysis and verification of GPU kernels. In ESOP, pages 270–289. Springer, 2013.
[9]
B. Gaster. A look at the OpenCL 2.0 execution model. In IWOCL, pages 2:1–2:1. ACM, 2015.
[10]
B. R. Gaster, D. Hower, and L. Howes. HRF-relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models. Trans. Archit. Code Optim., 2015.
[11]
K. Gupta, J. Stuart, and J. D. Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Proceedings of Innovative Parallel Computing, InPar, pages 1–14. IEEE, 2012.
[12]
M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., 2008.
[13]
D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneousrace-free memory models. In ASPLOS, pages 427–440. ACM, 2014.
[14]
Intel. The compute architecture of Intel processor graphics gen9, version 1.0, Aug. 2015.
[15]
ISO/IEC. Standard for programming language C++, 2012.
[16]
Khronos Group. The OpenCL C specification version: 2.0. https://www.khronos.org/registry/cl/ specs/opencl-2.0-openclc.pdf.
[17]
Khronos Group. The OpenCL specification version: 2.0 (rev. 29), July 2015.
[18]
https://www.khronos.org/ registry/cl/specs/opencl-2.0.pdf.
[19]
G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan. GKLEE: concolic verification and test generation for GPUs. In PPoPP, pages 215–224. ACM, 2012.
[20]
S. Maleki, A. Yang, and M. Burtscher. Higher-order and tuplebased massively-parallel prefix sums. In PLDI, pages 539– 552. ACM, 2016.
[21]
D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In PPoPP, pages 117–128. ACM, 2012.
[22]
M. Mrozek and Z. Zdanowicz. GPU daemon: Road to zero cost submission. In IWOCL, pages 11:1–11:4. ACM, 2016.
[23]
Nvidia. CUB, April 2015. http://nvlabs.github. io/cub/.
[24]
Nvidia. CUDA C programming guide, version 7, March 2015. http://docs.nvidia.com/cuda/pdf/ CUDA_C_Programming_Guide.pdf.
[25]
OpenMP Architecture Review Board. OpenMP application programming interface version 4.5, November 2015.
[26]
M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D. Hill, and D. A. Wood. Synchronization using remote-scope promotion. In ASPLOS, pages 73–86. ACM, 2015.
[27]
S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels. In ASPLOS, pages 407–418. ACM, 2013.
[28]
Y. Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Publishing, 2009.
[29]
T. Sorensen and A. F. Donaldson. The hitchhiker’s guide to cross-platform OpenCL application development. IWOCL, pages 2:1–2:12. ACM, 2016.
[30]
Y. Torres, A. Gonzalez-Escribano, and D. Llanos. Understanding the impact of CUDA tuning techniques for Fermi. In High Performance Computing and Simulation (HPCS), pages 631–639, 2011.
[31]
S. Tzeng, A. Patney, and J. D. Owens. Task management for irregular-parallel workloads on the GPU. In HPG, pages 29– 37, 2010.
[32]
B. Wu, G. Chen, D. Li, X. Shen, and J. Vetter. Enabling and exploiting flexible task assignment on GPU through SMcentric program transformations. In ICS, pages 119–130. ACM, 2015.
[33]
S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. In IPDPS, pages 1–12. IEEE, 2010.

Cited By

View all
  • (2024)HiRace: Accurate and Fast Data Race Checking for GPU ProgramsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00042(1-14)Online publication date: 17-Nov-2024
  • (2024)Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00056(671-685)Online publication date: 2-Nov-2024
  • (2023)High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel ConstructsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577475(119-134)Online publication date: 25-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications
October 2016
915 pages
ISBN:9781450344449
DOI:10.1145/2983990
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. OpenCL
  3. barrier
  4. portability
  5. synchronisation

Qualifiers

  • Research-article

Conference

SPLASH '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 268 of 1,244 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)8
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)HiRace: Accurate and Fast Data Race Checking for GPU ProgramsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00042(1-14)Online publication date: 17-Nov-2024
  • (2024)Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00056(671-685)Online publication date: 2-Nov-2024
  • (2023)High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel ConstructsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577475(119-134)Online publication date: 25-Feb-2023
  • (2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
  • (2023)Experience Migrating OpenCL to SYCL: A Case Study on Searches for Potential Off-Target Sites of Cas9 RNA-Guided Endonucleases on AMD GPUs2023 IEEE 36th International System-on-Chip Conference (SOCC)10.1109/SOCC58585.2023.10256881(1-6)Online publication date: 5-Sep-2023
  • (2022)Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00064(800-813)Online publication date: Apr-2022
  • (2022)CVFuzzFuture Generation Computer Systems10.1016/j.future.2021.09.006127:C(384-395)Online publication date: 1-Feb-2022
  • (2020)Foundations of empirical memory consistency testingProceedings of the ACM on Programming Languages10.1145/34282944:OOPSLA(1-29)Online publication date: 13-Nov-2020
  • (2020)GOPipeProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414656(43-54)Online publication date: 30-Sep-2020
  • (2020)Towards Real-time CNN Inference from a Video Stream on a Mobile GPU (WiP Paper)The 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3372799.3394366(136-140)Online publication date: 16-Jun-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media