research-article

Portable inter-workgroup barrier synchronisation for GPUs

Authors:

Tyler Sorensen,

Alastair F. Donaldson,

Ganesh Gopalakrishnan,

Zvonimir RakamarićAuthors Info & Claims

OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications

Pages 39 - 58

https://doi.org/10.1145/2983990.2984032

Published: 19 October 2016 Publication History

Abstract

Despite the growing popularity of GPGPU programming, there is not yet a portable and formally-specified barrier that one can use to synchronise across workgroups. Moreover, the occupancy-bound execution model of GPUs breaks assumptions inherent in traditional software execution barriers, exposing them to deadlock. We present an occupancy discovery protocol that dynamically discovers a safe estimate of the occupancy for a given GPU and kernel, allowing for a starvation-free (and hence, deadlock-free) inter-workgroup barrier by restricting the number of workgroups according to this estimate. We implement this idea by adapting an existing, previously non-portable, GPU inter-workgroup barrier to use OpenCL 2.0 atomic operations, and prove that the barrier meets its natural specification in terms of synchronisation.

We assess the portability of our approach over eight GPUs spanning four vendors, comparing the performance of our method against alternative methods. Our key findings include: (1) the recall of our discovery protocol is nearly 100%; (2) runtime comparisons vary substantially across GPUs and applications; and (3) our method provides portable and safe inter-workgroup synchronisation across the applications we study.

References

[1]

J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson. GPU concurrency: Weak behaviours and programming assumptions. In ASPLOS, pages 577–591. ACM, 2015.

Digital Library

[2]

M. Batty, M. Dodds, and A. Gotsman. Library abstraction for C/C++ concurrency. In POPL, pages 235–248. ACM, 2013.

Digital Library

[3]

M. Batty, A. F. Donaldson, and J. Wickerson. Overhauling SC atomics in C11 and OpenCL. In POPL, pages 634–648. ACM, 2016.

Digital Library

[4]

A. Betts, N. Chong, A. F. Donaldson, J. Ketema, S. Qadeer, P. Thomson, and J. Wickerson. The design and implementation of a verification technique for GPU kernels. ACM Trans. Program. Lang. Syst., 37(3):10, 2015.

Digital Library

[5]

M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on GPUs. In IISWC, pages 141–151. IEEE, 2012.

Digital Library

[6]

D. Cederman and P. Tsigas. On dynamic load balancing on graphics processors. In SIGGRAPH, pages 57–64. Eurographics Association, 2008.

Digital Library

[7]

S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Understanding irregular GPGPU graph applications. In IISWC, pages 185–195. IEEE, 2013.

[8]

P. Collingbourne, A. F. Donaldson, J. Ketema, and S. Qadeer. Interleaving and lock-step semantics for analysis and verification of GPU kernels. In ESOP, pages 270–289. Springer, 2013.

Digital Library

[9]

B. Gaster. A look at the OpenCL 2.0 execution model. In IWOCL, pages 2:1–2:1. ACM, 2015.

Digital Library

[10]

B. R. Gaster, D. Hower, and L. Howes. HRF-relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models. Trans. Archit. Code Optim., 2015.

Digital Library

[11]

K. Gupta, J. Stuart, and J. D. Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Proceedings of Innovative Parallel Computing, InPar, pages 1–14. IEEE, 2012.

[12]

M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., 2008.

Digital Library

[13]

D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneousrace-free memory models. In ASPLOS, pages 427–440. ACM, 2014.

Digital Library

[14]

Intel. The compute architecture of Intel processor graphics gen9, version 1.0, Aug. 2015.

[15]

ISO/IEC. Standard for programming language C++, 2012.

[16]

Khronos Group. The OpenCL C specification version: 2.0. https://www.khronos.org/registry/cl/ specs/opencl-2.0-openclc.pdf.

[17]

Khronos Group. The OpenCL specification version: 2.0 (rev. 29), July 2015.

[18]

https://www.khronos.org/ registry/cl/specs/opencl-2.0.pdf.

[19]

G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan. GKLEE: concolic verification and test generation for GPUs. In PPoPP, pages 215–224. ACM, 2012.

Digital Library

[20]

S. Maleki, A. Yang, and M. Burtscher. Higher-order and tuplebased massively-parallel prefix sums. In PLDI, pages 539– 552. ACM, 2016.

Digital Library

[21]

D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In PPoPP, pages 117–128. ACM, 2012.

Digital Library

[22]

M. Mrozek and Z. Zdanowicz. GPU daemon: Road to zero cost submission. In IWOCL, pages 11:1–11:4. ACM, 2016.

Digital Library

[23]

Nvidia. CUB, April 2015. http://nvlabs.github. io/cub/.

[24]

Nvidia. CUDA C programming guide, version 7, March 2015. http://docs.nvidia.com/cuda/pdf/ CUDA_C_Programming_Guide.pdf.

[25]

OpenMP Architecture Review Board. OpenMP application programming interface version 4.5, November 2015.

[26]

M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D. Hill, and D. A. Wood. Synchronization using remote-scope promotion. In ASPLOS, pages 73–86. ACM, 2015.

Digital Library

[27]

S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels. In ASPLOS, pages 407–418. ACM, 2013.

Digital Library

[28]

Y. Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Publishing, 2009.

[29]

T. Sorensen and A. F. Donaldson. The hitchhiker’s guide to cross-platform OpenCL application development. IWOCL, pages 2:1–2:12. ACM, 2016.

Digital Library

[30]

Y. Torres, A. Gonzalez-Escribano, and D. Llanos. Understanding the impact of CUDA tuning techniques for Fermi. In High Performance Computing and Simulation (HPCS), pages 631–639, 2011.

[31]

S. Tzeng, A. Patney, and J. D. Owens. Task management for irregular-parallel workloads on the GPU. In HPG, pages 29– 37, 2010.

Digital Library

[32]

B. Wu, G. Chen, D. Li, X. Shen, and J. Vetter. Enabling and exploiting flexible task assignment on GPU through SMcentric program transformations. In ICS, pages 119–130. ACM, 2015.

Digital Library

[33]

S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. In IPDPS, pages 1–12. IEEE, 2010.

Cited By

Jacobson JBurtscher MGopalakrishnan G(2024)HiRace: Accurate and Fast Data Race Checking for GPU ProgramsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00042(1-14)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00042
Zhang YWang MWang WMai YHuang HYu Z(2024)Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00056(671-685)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00056
Moses WIvanov IDomke JEndo TDoerfert JZinenko ODehnavi MKulkarni MKrishnamoorthy S(2023)High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel ConstructsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577475(119-134)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577475
Show More Cited By

Index Terms

Portable inter-workgroup barrier synchronisation for GPUs
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Portable inter-workgroup barrier synchronisation for GPUs
OOPSLA '16

Despite the growing popularity of GPGPU programming, there is not yet a portable and formally-specified barrier that one can use to synchronise across workgroups. Moreover, the occupancy-bound execution model of GPUs breaks assumptions inherent in ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
An OpenCL micro-benchmark suite for GPUs and CPUs

Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications

October 2016

915 pages

ISBN:9781450344449

DOI:10.1145/2983990

General Chair:
Eelco Visser
Delft University of Technology, Netherlands
,
Program Chair:
Yannis Smaragdakis
University of Athens, Greece

ACM SIGPLAN Notices Volume 51, Issue 10
OOPSLA '16
October 2016
915 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3022671
Editor:
Matthew Fluet
Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

In-Cooperation

SIGAda: ACM Special Interest Group on Ada Programming Language

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SPLASH '16

Sponsor:

SIGPLAN

SPLASH '16: Conference on Systems, Programming, Languages, and Applications: Software for Humanity

November 2 - 4, 2016

Amsterdam, Netherlands

Acceptance Rates

Overall Acceptance Rate 268 of 1,244 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
353
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)8

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jacobson JBurtscher MGopalakrishnan G(2024)HiRace: Accurate and Fast Data Race Checking for GPU ProgramsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00042(1-14)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00042
Zhang YWang MWang WMai YHuang HYu Z(2024)Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00056(671-685)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00056
Moses WIvanov IDomke JEndo TDoerfert JZinenko ODehnavi MKulkarni MKrishnamoorthy S(2023)High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel ConstructsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577475(119-134)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577475
Dalmia PMahapatra RIntan JNegrut DSinclair M(2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3218508
Jin ZVetter J(2023)Experience Migrating OpenCL to SYCL: A Case Study on Searches for Potential Off-Target Sites of Cas9 RNA-Guided Endonucleases on AMD GPUs2023 IEEE 36th International System-on-Chip Conference (SOCC)10.1109/SOCC58585.2023.10256881(1-6)Online publication date: 5-Sep-2023
https://doi.org/10.1109/SOCC58585.2023.10256881
Zhao HCui WChen QZhang YLu YLi CLeng JGuo M(2022)Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00064(800-813)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00064
Li WChen ZHe XDuan GSun JChen H(2022)CVFuzzFuture Generation Computer Systems10.1016/j.future.2021.09.006127:C(384-395)Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1016/j.future.2021.09.006
Kirkham JSorensen TTureci EMartonosi M(2020)Foundations of empirical memory consistency testingProceedings of the ACM on Programming Languages10.1145/34282944:OOPSLA(1-29)Online publication date: 13-Nov-2020
https://dl.acm.org/doi/10.1145/3428294
Oh CZheng ZShen XZhai JYi YSarkar VKim H(2020)GOPipeProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414656(43-54)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414656
Oh CPark GKim SKim DYi YXue JJung C(2020)Towards Real-time CNN Inference from a Video Stream on a Mobile GPU (WiP Paper)The 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3372799.3394366(136-140)Online publication date: 16-Jun-2020
https://dl.acm.org/doi/10.1145/3372799.3394366
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents