research-article

Public Access

Effective resource management for enhancing performance of 2D and 3D stencils on GPUs

Authors:

Prashant Singh Rawat,

Mahesh Ravishankar,

Louis-Noël Pouchet,

P. SadayappanAuthors Info & Claims

GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Pages 92 - 102

https://doi.org/10.1145/2884045.2884047

Published: 12 March 2016 Publication History

Abstract

GPUs are an attractive target for data parallel stencil computations prevalent in scientific computing and image processing applications. Many tiling schemes, such as overlapped tiling and split tiling, have been proposed in past to improve the performance of stencil computations. While effective for 2D stencils, these techniques do not achieve the desired improvements for 3D stencils due to the hardware constraints of GPU.

A major challenge in optimizing stencil computations is to effectively utilize all resources available on the GPU. In this paper we develop a tiling strategy that makes better use of resources like shared memory and register file available on the hardware. We present a systematic methodology to reason about which strategy should be employed for a given stencil and also discuss implementation choices that have a significant effect on the achieved performance. Applying these techniques to various 2D and 3D stencils gives a performance improvement of 200-400% over existing tools that target such computations.

References

[1]

M. Christen, O. Schenk, and H. Burkhart. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS '11, pages 676--687. IEEE Computer Society, 2011.

Digital Library

[2]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages 4:1--4:12. IEEE Press, 2008.

Digital Library

[3]

T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, pages 66:66--66:75. ACM, 2014.

Digital Library

[4]

T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pages 24--31. ACM, 2013.

Digital Library

[5]

J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS '12, pages 311--320. ACM, 2012.

Digital Library

[6]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '07, pages 235--244. ACM, 2007.

Digital Library

[7]

P. Micikevicius. 3D finite difference computation on GPUs using CUDA. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 79--84. ACM, 2009.

Digital Library

[8]

R. T. Mullapudi, V. Vasista, and U. Bondhugula. Polymage: Automatic optimization for image processing pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 429--443. ACM, 2015.

Digital Library

[9]

A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--13. IEEE Computer Society, 2010.

Digital Library

[10]

J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530. ACM, 2013.

Digital Library

[11]

M. Ravishankar, J. Holewinski, and V. Grover. Forma: A DSL for image processing applications to target GPUs and multi-core CPUs. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs, GPGPU-8, pages 109--120. ACM, 2015.

Digital Library

[12]

M. Ravishankar, P. Micikevicius, and V. Grover. Fusing convolution kernels through tiling. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY 2015, pages 43--48. ACM, 2015.

Digital Library

[13]

S. Verdoolaege, J. C. Juega, A. Cohen, J. I. Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM TACO, 9(4):54:1--54:23, Jan. 2013.

Digital Library

[14]

V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages 31:1--31:11. IEEE Press, 2008.

Digital Library

[15]

Y. Zhang and F. Mueller. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO '12, pages 155--164. ACM, 2012.

Digital Library

Cited By

Chen YLi KWang YBai DWang LMa LYuan LZhang YCao TYang MLee IChabbi MSteuwer M(2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638476
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Revisiting Temporal Blocking Stencil OptimizationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593716(251-263)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593716
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU ApplicationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593705(167-179)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593705
Show More Cited By

Index Terms

Effective resource management for enhancing performance of 2D and 3D stencils on GPUs
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

A coordinated tiling and batching framework for efficient GEMM on GPUs
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the ...
Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE

We are witnessing the consolidation of the heterogeneous computing in parallel computing with architectures such as Cell Broadband Engine (Cell BE) or Graphics Processing Units (GPUs) which are present in a myriad of developments for high performance ...
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations

The most commonly used approach for solving reaction---diffusion systems relies upon stencil computations. Although stencil computations feature low compute intensity, they place high demands on memory bandwidth. Fortunately, GPU computing allows for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit

March 2016

107 pages

ISBN:9781450341950

DOI:10.1145/2884045

Conference Chairs:
David Kaeli,
John Cavazos

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 March 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

PPoPP '16

PPoPP '16: 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

March 12, 2016

Barcelona, Spain

Acceptance Rates

GPGPU '16 Paper Acceptance Rate 9 of 23 submissions, 39%;

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
753
Total Downloads

Downloads (Last 12 months)114
Downloads (Last 6 weeks)15

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen YLi KWang YBai DWang LMa LYuan LZhang YCao TYang MLee IChabbi MSteuwer M(2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638476
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Revisiting Temporal Blocking Stencil OptimizationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593716(251-263)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593716
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU ApplicationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593705(167-179)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593705
Liu XLiu YYang HLiao JLi MLuan ZQian DRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Toward accelerated stencil computation by adapting tensor core unit on GPUProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532392(1-12)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532392
Anjum OAlmasri Mde Gonzalo SHwu W(2022)An efficient GPU implementation and scaling for higher-order 3D stencilsInformation Sciences: an International Journal10.1016/j.ins.2021.11.042586:C(326-343)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1016/j.ins.2021.11.042
Sun QLiu YYang HJiang ZLiu XDun MLuan ZQian D(2021)csTuner: Scalable Auto-tuning Framework for Complex Stencil Computation on GPUs2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00037(192-203)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00037
Date PPotok T(2021)Adiabatic quantum linear regressionScientific Reports10.1038/s41598-021-01445-611:1Online publication date: 9-Nov-2021
https://doi.org/10.1038/s41598-021-01445-6
Rasch ASchulze RGorlatch S(2019)Generating Portable High-Performance Code via Multi-Dimensional HomomorphismsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00035(353-368)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00035
Rawat PVaidya MSukumaran-Rajam ARountev APouchet LSadayappan P(2019)On Optimizing Complex Stencils on GPUs2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00073(641-652)Online publication date: May-2019
https://doi.org/10.1109/IPDPS.2019.00073
Anjum OSimon GHidayetoglu MHwu W(2019)An Efficient GPU Implementation Technique for Higher-Order 3D Stencils2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00086(552-561)Online publication date: Aug-2019
https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00086
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents