research-article

Split tiling for GPUs: automatic parallelization using trapezoidal tiles

Authors:
Tobias Grosser

École Normale Supérieure

École Normale Supérieure
View Profile

,
Albert Cohen

École Normale Supérieure

École Normale Supérieure
View Profile

,
Paul H. J. Kelly

Imperial College London

Imperial College London
View Profile

,
J. Ramanujam

Louisiana State University

Louisiana State University
View Profile

,
P. Sadayappan

Ohio State University

Ohio State University
View Profile

,
Sven Verdoolaege

École Normale Supérieure

École Normale Supérieure
View Profile

GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing UnitsMarch 2013Pages 24–31https://doi.org/10.1145/2458523.2458526

Published:16 March 2013Publication History

GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

Pages 24–31

ABSTRACT

Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism.

One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs.

References

M. Amini, F. Coelho, F. Irigoin, and R. Keryell. Static compilation analysis for host-accelerator communication optimization. In Workshop on Languages and Compilers for Parallel Computing (LCPC'11), LNCS. Springer-Verlag, Oct. 2011.Google Scholar
M. Amini, B. Creusillet, S. Even, R. Keryell, O. Goubier, S. Guelton, J. O. McMahon, F. X. Pasquier, G. Péan, and P. Villalon. Par4all: From convex array regions to heterogeneous computing. In IMPACT'12, Paris, France, Jan. 2012.Google Scholar
V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In Proceedings of SC '12, pages 40:1--40:11, Los Alamitos, CA, USA, 2012. IEEE. Google ScholarDigital Library
M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-cuda code generation for affine programs. In Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction, CC'10/ETAPS'10, pages 244--263, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarDigital Library
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, PLDI'08, pages 101--113, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium IPDPS '11, pages 676--687, Washington, DC, USA, 2011. IEEE. Google ScholarDigital Library
K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, and K. A. Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review, 51(1):129--159, 2009. Google ScholarDigital Library
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of SC '08, pages 4:1--4:12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
P. Di and J. Xue. Model-driven tile size selection for DOACROSS loops on GPUs. In Proceedings of the 17th international conference on Parallel processing - Volume Part II, Euro-Par'11, pages 401--412, Berlin, Heidelberg, 2011. Springer-Verlag. Google ScholarDigital Library
D. Han, S. Xu, L. Chen, and L. Huang. Pads: A pattern-driven stencil compiler-based tool for reuse of optimizations on gpgpus. In ICPADS, pages 308--315, 2011. Google ScholarDigital Library
J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pages 311--320, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, PLDI'07, pages 235--244, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
J. Meng and K. Skadron. A performance study for iterative stencil loops on gpus with ghost zone optimizations. International Journal of Parallel Programming, 39(1):115--142, 2011.Google ScholarCross Ref
P. Micikevicius. 3d finite difference computation on gpus using cuda. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 79--84, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In Proceedings of SC '10, pages 1--13, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
The OpenACC standard, 2011.Google Scholar
M. M. Strout, L. Carter, J. Ferrante, J. Freeman, and B. Kreaseck. Combining performance aspects of irregular gauss-seidel via sparse tiling. In LCPC, pages 90--110, 2002. Google ScholarDigital Library
R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache oblivious parallelograms in iterative stencil computations. In ICS, pages 49--59, 2010. Google ScholarDigital Library
R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache accurate time skewing in iterative stencil computations. 2012 41st International Conference on Parallel Processing, 0:571--581, 2011. Google ScholarDigital Library
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 117--128, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
S. Verdoolaege. Counting affine calculator and applications. In First International Workshop on Polyhedral Compilation Techniques (IMPACT'11), Charmonix, France, Apr. 2011.Google Scholar
S. Verdoolaege and T. Grosser. Polyhedral extraction tool. In Second International Workshop on Polyhedral Compilation Techniques (IMPACT'12), Paris, France, Jan. 2012.Google Scholar
S. Verdoolaege, J. C. Juega, A. Cohen, J. I. Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization(TACO), Dec. 2012. Selected for presentation at the HiPEAC 2013 Conf. Google ScholarDigital Library
X. Zhou, J.-P. Giacalone, M. J. Garzarán, R. H. Kuhn, Y. Ni, and D. Padua. Hierarchical overlapped tiling. In Proceedings of the 10th Intl. Symp. Code Gen. and Opt., CGO '12, pages 207--218, New York, NY, USA, 2012. ACM. Google ScholarDigital Library

Index Terms

Split tiling for GPUs: automatic parallelization using trapezoidal tiles
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Polyhedral parallel code generation for CUDA
Special Issue on High-Performance Embedded Architectures and Compilers

This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static ...
Read More
Hybrid Hexagonal/Classical Tiling for GPUs
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Time-tiling is necessary for the efficient execution of iterative stencil computations. Classical hyper-rectangular tiles cannot be used due to the combination of backward and forward dependences along space dimensions. Existing techniques trade ...
Read More
Hybrid Hexagonal/Classical Tiling for GPUs
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Time-tiling is necessary for the efficient execution of iterative stencil computations. Classical hyper-rectangular tiles cannot be used due to the combination of backward and forward dependences along space dimensions. Existing techniques trade ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
March 2013
156 pages
ISBN:9781450320177
DOI:10.1145/2458523
Editors:
John Cavazos
University of Delaware
,
Xiang Gong,
David Kaeli
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 March 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CUDA
GPGPU
code generation
compilers
index set splitting
loop transformations
polyhedral model
stencil
time tiling
Qualifiers
- research-article
Conference

Acceptance Rates
GPGPU-6 Paper Acceptance Rate15of37submissions,41%Overall Acceptance Rate57of129submissions,44%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 68
  Total Citations
  View Citations
- 623
  Total Downloads
- Downloads (Last 12 months)49
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.