tutorial

Hybrid Hexagonal/Classical Tiling for GPUs

Authors:

Tobias Grosser,

Justin Holewinski,

Sven VerdoolaegeAuthors Info & Claims

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Pages 66 - 75

https://doi.org/10.1145/2581122.2544160

Published: 15 February 2014 Publication History

Abstract

Time-tiling is necessary for the efficient execution of iterative stencil computations. Classical hyper-rectangular tiles cannot be used due to the combination of backward and forward dependences along space dimensions. Existing techniques trade temporal data reuse for inefficiencies in other areas, such as load imbalance, redundant computations, or increased control flow overhead, therefore making it challenging for use with GPUs.

We propose a time-tiling method for iterative stencil computations on GPUs. Our method does not involve redundant computations. It favors coalesced global-memory accesses, data reuse in local/shared-memory or cache, avoidance of thread divergence, and concurrency, combining hexagonal tile shapes along the time and one spatial dimension with classical tiling along the other spatial dimensions. Hexagonal tiles expose multi-level parallelism as well as data reuse. Experimental results demonstrate significant performance improvements over existing stencil compilers.

References

[1]

M. Amini, B. Creusillet, S. Even, R. Keryell, O. Goubier, S. Guelton, J. O. McMahon, F.-X. Pasquier, G. Péan, P. Villalon, et al. Par4All: From convex array regions to heterogeneous computing. In IMPACT, 2012.

[2]

V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In Supercomputing, page 40. IEEE Computer Society Press, 2012.

Digital Library

[3]

U. Bondhugula, J. Ramanujam, and et al. PLuTo: A practical and fully automatic polyhedral program optimization system. In PLDI, 2008.

Digital Library

[4]

M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In IPDPS, 2011.

Digital Library

[5]

P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23--53, 1991.

Digital Library

[6]

P. Feautrier. The Data Parallel Programming Model, volume 1132 of LNCS, chapter Automatic Parallelization in the Polytope Model, pages 79--100. Springer, 1996.

Digital Library

[7]

T. Grosser, A. Cohen, P. H. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege. Split tiling for GPUs: automatic parallelization using trapezoidal tiles. In GPGPU-6, pages 24--31. ACM, 2013.

Digital Library

[8]

T. Grosser, A. Grösslinger, and C. Lengauer. Polly -- performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(04):1250010, 2012.

[9]

T. Grosser, S. Verdoolaege, A. Cohen, and P. Sadayappan. The relation between diamond tiling and hexagonal tiling. In 1st Int. Workshop on High-Performance Stencil Computations (HiStencils 2014), Vienna, Austria, Jan. 2014.

[10]

T. Henretty, R. Veras, F. Franchetti, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan. A stencil compiler for short-vector simd architectures. In ICS. ACM, 2013.

Digital Library

[11]

J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on GPU architectures. In ICS, 2012.

Digital Library

[12]

F. Irigoin and R. Triolet. Supernode partitioning. In POPL, pages 319--328, San Diego, CA, Jan. 1988.

Digital Library

[13]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, pages 235--244, 2007.

Digital Library

[14]

A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In GPGPU-3, 2010.

Digital Library

[15]

J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation, Seattle, WA, June 2013.

Digital Library

[16]

M. Ravishankar, J. Eisenlohr, L.-N. Pouchet, J. Ramanujam, A. Rountev, and P. Sadayappan. Code generation for parallel execution of a class of irregular loops on distributed memory systems. In Supercomputing, pages 1--11, 2012.

Digital Library

[17]

G. Smith. Numerical Solution of Partial Differential Equations: Finite Difference Methods. Oxford University Press, 2004.

[18]

A. Taflove. Computational electrodynamics: The Finite-difference time-domain method. Artech House, 1995.

[19]

Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In SPAA, pages 117--128. ACM, 2011.

Digital Library

[20]

N. Vasilache, C. Bastoul, and A. Cohen. Polyhedral code generation in the real world. In Proceedings of the International Conf. on Compiler Construction (ETAPS CC), Vienna, Austria, Mar. 2006. Springer.

Digital Library

[21]

N. Vasilache, B. Meister, M. Baskaran, and R. Lethin. Joint scheduling and layout optimization to enable multi-level vectorization. In IMPACT, Paris, France, Jan. 2012.

[22]

S. Verdoolaege. isl: An integer set library for the polyhedral model. In Mathematical Software--ICMS 2010, pages 299--302. Springer, 2010.

Digital Library

[23]

S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM TACO, 9(4):54, 2013.

Digital Library

[24]

S. Verdoolaege and T. Grosser. Polyhedral extraction tool. In IMPACT, 2012.

[25]

S. Verdoolaege, R. Seghir, K. Beyls, V. Loechner, and M. Bruynooghe. Counting integer points in parametric polytopes using Barvinok's rational functions. Algorithmica, 48(1):37--66, June 2007.

Digital Library

[26]

M. Wolfe. High Performance Compilers for Parallel Computing. Addison Wesley, 1996.

Digital Library

Cited By

Plotnitskii PBeaurepaire LQu LAkbudak KLtaief HKeyes DEvans KSchenk O(2024)Leveraging the High Bandwidth of Last-Level Cache for HPC Seismic Imaging ApplicationsProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3659914.3659936(1-13)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3659914.3659936
Dolp RHanika JDachsbacher C(2024)A Fast GPU Schedule For À-Trous Wavelet-Based DenoisersProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36512997:1(1-18)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3651299
Sun QLiu YYang HJiang ZLuan ZQian D(2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3325630
Show More Cited By

Index Terms

Hybrid Hexagonal/Classical Tiling for GPUs
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Split tiling for GPUs: automatic parallelization using trapezoidal tiles
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the ...
Hybrid Hexagonal/Classical Tiling for GPUs
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Time-tiling is necessary for the efficient execution of iterative stencil computations. Classical hyper-rectangular tiles cannot be used due to the combination of backward and forward dependences along space dimensions. Existing techniques trade ...
Best-effort semantic document search on GPUs
GPGPU-3: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units

Semantic indexing is a popular technique used to access and organize large amounts of unstructured text data. We describe an optimized implementation of semantic indexing and document search on manycore GPU platforms. We observed that a parallel ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 2014

328 pages

ISBN:9781450326704

DOI:10.1145/2581122

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Tutorial
Research
Refereed limited

Conference

CGO '14

Sponsor:

CGO '14: 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 15 - 19, 2014

FL, Orlando, USA

Acceptance Rates

CGO '14 Paper Acceptance Rate 29 of 100 submissions, 29%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

64
Total Citations
View Citations
694
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Plotnitskii PBeaurepaire LQu LAkbudak KLtaief HKeyes DEvans KSchenk O(2024)Leveraging the High Bandwidth of Last-Level Cache for HPC Seismic Imaging ApplicationsProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3659914.3659936(1-13)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3659914.3659936
Dolp RHanika JDachsbacher C(2024)A Fast GPU Schedule For À-Trous Wavelet-Based DenoisersProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36512997:1(1-18)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3651299
Sun QLiu YYang HJiang ZLuan ZQian D(2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3325630
Haichour ABenfriha K(2024)Empowering Real-Time IoT Applications: A Brief Review on Leveraging GPU Acceleration for Latency ReductionInternet of Things. 7th IFIPIoT 2024 International IFIP WG 5.5 Workshops10.1007/978-3-031-82065-6_8(107-120)Online publication date: 29-Dec-2024
https://doi.org/10.1007/978-3-031-82065-6_8
Zhao JXu JDi PNie WHu JYi YYang SGeng ZZhang RLi BGan ZJin X(2023)Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine RelationsACM Transactions on Computer Systems10.1145/363530541:1-4(1-45)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1145/3635305
Qu LAbdelkhalak RLtaief HSaid IKeyes D(2022)Exploiting temporal data reuse and asynchrony in the reverse time migrationThe International Journal of High Performance Computing Applications10.1177/1094342022112852937:2(132-150)Online publication date: 3-Oct-2022
https://doi.org/10.1177/10943420221128529
Bielecki WPalkowski M(2021)Space-Time Loop Tiling for Dynamic Programming CodesElectronics10.3390/electronics1018223310:18(2233)Online publication date: 12-Sep-2021
https://doi.org/10.3390/electronics10182233
Gysi TMüller CZinenko OHerhut SDavis EWicky TFuhrer OHoefler TGrosser T(2021)Domain-Specific Multi-Level IR Rewriting for GPUACM Transactions on Architecture and Code Optimization10.1145/346903018:4(1-23)Online publication date: 3-Sep-2021
https://dl.acm.org/doi/10.1145/3469030
Jangda AGuha ASarkar VKim H(2020)Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUsProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414649(317-328)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414649
Korch MWerner T(2020)An in‐depth introduction of multi‐workgroup tiling for improving the locality of explicit one‐step methods for ODE systems with limited access distance on GPUsConcurrency and Computation: Practice and Experience10.1002/cpe.601633:11Online publication date: 22-Sep-2020
https://doi.org/10.1002/cpe.6016
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten