skip to main content
10.1145/2581122.2544160acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
tutorial

Hybrid Hexagonal/Classical Tiling for GPUs

Published: 15 February 2014 Publication History

Abstract

Time-tiling is necessary for the efficient execution of iterative stencil computations. Classical hyper-rectangular tiles cannot be used due to the combination of backward and forward dependences along space dimensions. Existing techniques trade temporal data reuse for inefficiencies in other areas, such as load imbalance, redundant computations, or increased control flow overhead, therefore making it challenging for use with GPUs.
We propose a time-tiling method for iterative stencil computations on GPUs. Our method does not involve redundant computations. It favors coalesced global-memory accesses, data reuse in local/shared-memory or cache, avoidance of thread divergence, and concurrency, combining hexagonal tile shapes along the time and one spatial dimension with classical tiling along the other spatial dimensions. Hexagonal tiles expose multi-level parallelism as well as data reuse. Experimental results demonstrate significant performance improvements over existing stencil compilers.

References

[1]
M. Amini, B. Creusillet, S. Even, R. Keryell, O. Goubier, S. Guelton, J. O. McMahon, F.-X. Pasquier, G. Péan, P. Villalon, et al. Par4All: From convex array regions to heterogeneous computing. In IMPACT, 2012.
[2]
V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In Supercomputing, page 40. IEEE Computer Society Press, 2012.
[3]
U. Bondhugula, J. Ramanujam, and et al. PLuTo: A practical and fully automatic polyhedral program optimization system. In PLDI, 2008.
[4]
M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In IPDPS, 2011.
[5]
P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23--53, 1991.
[6]
P. Feautrier. The Data Parallel Programming Model, volume 1132 of LNCS, chapter Automatic Parallelization in the Polytope Model, pages 79--100. Springer, 1996.
[7]
T. Grosser, A. Cohen, P. H. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege. Split tiling for GPUs: automatic parallelization using trapezoidal tiles. In GPGPU-6, pages 24--31. ACM, 2013.
[8]
T. Grosser, A. Grösslinger, and C. Lengauer. Polly -- performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(04):1250010, 2012.
[9]
T. Grosser, S. Verdoolaege, A. Cohen, and P. Sadayappan. The relation between diamond tiling and hexagonal tiling. In 1st Int. Workshop on High-Performance Stencil Computations (HiStencils 2014), Vienna, Austria, Jan. 2014.
[10]
T. Henretty, R. Veras, F. Franchetti, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan. A stencil compiler for short-vector simd architectures. In ICS. ACM, 2013.
[11]
J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on GPU architectures. In ICS, 2012.
[12]
F. Irigoin and R. Triolet. Supernode partitioning. In POPL, pages 319--328, San Diego, CA, Jan. 1988.
[13]
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, pages 235--244, 2007.
[14]
A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In GPGPU-3, 2010.
[15]
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation, Seattle, WA, June 2013.
[16]
M. Ravishankar, J. Eisenlohr, L.-N. Pouchet, J. Ramanujam, A. Rountev, and P. Sadayappan. Code generation for parallel execution of a class of irregular loops on distributed memory systems. In Supercomputing, pages 1--11, 2012.
[17]
G. Smith. Numerical Solution of Partial Differential Equations: Finite Difference Methods. Oxford University Press, 2004.
[18]
A. Taflove. Computational electrodynamics: The Finite-difference time-domain method. Artech House, 1995.
[19]
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In SPAA, pages 117--128. ACM, 2011.
[20]
N. Vasilache, C. Bastoul, and A. Cohen. Polyhedral code generation in the real world. In Proceedings of the International Conf. on Compiler Construction (ETAPS CC), Vienna, Austria, Mar. 2006. Springer.
[21]
N. Vasilache, B. Meister, M. Baskaran, and R. Lethin. Joint scheduling and layout optimization to enable multi-level vectorization. In IMPACT, Paris, France, Jan. 2012.
[22]
S. Verdoolaege. isl: An integer set library for the polyhedral model. In Mathematical Software--ICMS 2010, pages 299--302. Springer, 2010.
[23]
S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM TACO, 9(4):54, 2013.
[24]
S. Verdoolaege and T. Grosser. Polyhedral extraction tool. In IMPACT, 2012.
[25]
S. Verdoolaege, R. Seghir, K. Beyls, V. Loechner, and M. Bruynooghe. Counting integer points in parametric polytopes using Barvinok's rational functions. Algorithmica, 48(1):37--66, June 2007.
[26]
M. Wolfe. High Performance Compilers for Parallel Computing. Addison Wesley, 1996.

Cited By

View all
  • (2024)Leveraging the High Bandwidth of Last-Level Cache for HPC Seismic Imaging ApplicationsProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3659914.3659936(1-13)Online publication date: 3-Jun-2024
  • (2024)A Fast GPU Schedule For À-Trous Wavelet-Based DenoisersProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36512997:1(1-18)Online publication date: 13-May-2024
  • (2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
  • Show More Cited By

Index Terms

  1. Hybrid Hexagonal/Classical Tiling for GPUs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
    February 2014
    328 pages
    ISBN:9781450326704
    DOI:10.1145/2581122
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 February 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CUDA
    2. GPGPU
    3. Polyhedral compilation
    4. code generation
    5. loop transformations
    6. stencil
    7. time tiling

    Qualifiers

    • Tutorial
    • Research
    • Refereed limited

    Conference

    CGO '14

    Acceptance Rates

    CGO '14 Paper Acceptance Rate 29 of 100 submissions, 29%;
    Overall Acceptance Rate 312 of 1,061 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Leveraging the High Bandwidth of Last-Level Cache for HPC Seismic Imaging ApplicationsProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3659914.3659936(1-13)Online publication date: 3-Jun-2024
    • (2024)A Fast GPU Schedule For À-Trous Wavelet-Based DenoisersProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36512997:1(1-18)Online publication date: 13-May-2024
    • (2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
    • (2024)Empowering Real-Time IoT Applications: A Brief Review on Leveraging GPU Acceleration for Latency ReductionInternet of Things. 7th IFIPIoT 2024 International IFIP WG 5.5 Workshops10.1007/978-3-031-82065-6_8(107-120)Online publication date: 29-Dec-2024
    • (2023)Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine RelationsACM Transactions on Computer Systems10.1145/363530541:1-4(1-45)Online publication date: 1-Dec-2023
    • (2022)Exploiting temporal data reuse and asynchrony in the reverse time migrationThe International Journal of High Performance Computing Applications10.1177/1094342022112852937:2(132-150)Online publication date: 3-Oct-2022
    • (2021)Space-Time Loop Tiling for Dynamic Programming CodesElectronics10.3390/electronics1018223310:18(2233)Online publication date: 12-Sep-2021
    • (2021)Domain-Specific Multi-Level IR Rewriting for GPUACM Transactions on Architecture and Code Optimization10.1145/346903018:4(1-23)Online publication date: 3-Sep-2021
    • (2020)Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUsProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414649(317-328)Online publication date: 30-Sep-2020
    • (2020)An in‐depth introduction of multi‐workgroup tiling for improving the locality of explicit one‐step methods for ODE systems with limited access distance on GPUsConcurrency and Computation: Practice and Experience10.1002/cpe.601633:11Online publication date: 22-Sep-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media