ABSTRACT
The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse. However, the resulting code is often highly constrained by register pressure. While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in sub-optimal code with a large number of register spills. In this paper, we develop a statement reordering framework that models stencil computations as a DAG of trees with shared leaves, and adapts an optimal scheduling algorithm for minimizing register usage for expression trees. The effectiveness of the approach is demonstrated through experimental results on a range of stencils extracted from application codes.
Supplemental Material
Available for Download
The is the software accompanying the paper "Register Optimizations for Stencils on GPUs" that appears in PPoPP'18. The software proposes a source-to-source register reordering framework that can help alleviate register pressure for high-order stencils on GPUs.The package contains: a. the source code for the reordering framework b. the examples used in the paper in the examples/ directory c. scripts for code installation and benchmarking.
- A. Aho, M. Lam, R. Sethi, and J. Ullman. 2007. Compilers: Principles, Techniques, and Tools (2nd ed). Pearson. Google ScholarDigital Library
- A. V. Aho and S. C. Johnson. 1975. Optimal Code Generation for Expression Trees. In Proceedings of Seventh Annual ACM Symposium on Theory of Computing (STOC '75). ACM, New York, NY, USA, 207--217. Google ScholarDigital Library
- A. V. Aho, S. C. Johnson, and J. D. Ullman. 1977. Code Generation for Expressions with Common Subexpressions. J. ACM 24, 1 (Jan. 1977), 146--160. Google ScholarDigital Library
- A. W. Appel and K. J. Supowit. 1987. Generalization of the Sethi-Ullman Algorithm for Register Allocation. Softw. Pract. Exper. 17, 6 (June 1987), 417--421. Google ScholarDigital Library
- Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling Stencil Computations to Maximize Parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 40, 11 pages. Google ScholarDigital Library
- P. Basu, M. Hall, S. Williams, B. V. Straalen, L. Oliker, and P. Colella. 2015. Compiler-Directed Transformation for Higher-Order Stencils. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. 313--323. Google ScholarDigital Library
- David A. Berson, Rajiv Gupta, and Mary Lou Soffa. 1999. Integrated Instruction Scheduling and Register Allocation Techniques. In Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing (LCPC '98). Springer-Verlag, London, UK, UK, 247--262. Google ScholarDigital Library
- Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, USA, 101--113. Google ScholarDigital Library
- Preston Briggs, Keith D. Cooper, and Linda Torczon. 1992. Rematerialization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation (PLDI '92). ACM, New York, NY, USA, 311--321. Google ScholarDigital Library
- Preston Briggs, Keith D. Cooper, and Linda Torczon. 1994. Improvements to Graph Coloring Register Allocation. ACM Trans. Program. Lang. Syst. 16, 3 (May 1994), 428--455. Google ScholarDigital Library
- G. J. Chaitin. 1982. Register Allocation & Spilling via Graph Coloring. In Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction (SIGPLAN '82). ACM, New York, NY, USA, 98--105. Google ScholarDigital Library
- J. M. Codina, J. Sanchez, and A. Gonzalez. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques. 175--184. Google ScholarDigital Library
- Q. Colombet, B. Boissinot, P. Brisk, S. Hack, and F. Rastello. 2011. Graph-coloring and treescan register allocation using repairing. In 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). 45--54. Google ScholarDigital Library
- Raúl De La Cruz, Mauricio Araya-Polo, and José María Cela. 2010. Introducing the Semi-stencil Algorithm. In Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I (PPAM'09). Springer-Verlag, Berlin, Heidelberg, 496--506. Google ScholarDigital Library
- Steven J. Deitz, Bradford L. Chamberlain, and Lawrence Snyder. 2001. Eliminating Redundancies in Sum-of-product Array Computations. In Proceedings of the 15th International Conference on Supercomputing (ICS '01). ACM, New York, NY, USA, 65--77. Google ScholarDigital Library
- Lukasz Domagala, Duco van Amstel, Fabrice Rastello, and P. Sadayappan. 2016. Register Allocation and Promotion Through Combined Instruction Scheduling and Loop Unrolling. In Proceedings of the 25th International Conference on Compiler Construction (CC 2016). ACM, New York, NY, USA, 143--151. Google ScholarDigital Library
- ExaCT 2013. ExaCT: Center for Exascale Simulation of Combustion in Turbulence: Proxy App Software. https://exactcodesign.org/proxy-app-software/. (2013).Google Scholar
- M. Frigo and S. G. Johnson. 2005. The Design and Implementation of FFTW3. Proc. IEEE 93, 2 (Feb 2005), 216--231.Google ScholarCross Ref
- Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-performance Matrix Multiplication. ACM Trans. Math. Softw. 34, 3, Article 12 (May 2008), 25 pages. Google ScholarDigital Library
- Ramaswamy Govindarajan, H. Yang, Chihong Zhang, José N. Amaral, and Guang R. Gao. 2001. Minimum Register Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs. In Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS '01). IEEE Computer Society, Washington, DC, USA, 26--33. Google ScholarDigital Library
- Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14). ACM, Article 66, 10 pages. Google ScholarDigital Library
- Tobias Gysi, Tobias Grosser, and Torsten Hoefler. 2015. MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, 177--186. Google ScholarDigital Library
- Mary Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy, and Malik Murtaza Khan. 2010. Loop Transformation Recipes for Code Generation and Auto-tuning. In Proceedings of the 22Nd International Conference on Languages and Compilers for Parallel Computing (LCPC'09). Springer-Verlag, Berlin, Heidelberg, 50--64. Google ScholarDigital Library
- Ari B. Hayes, Lingda Li, Daniel Chavarría-Miranda, Shuaiwen Leon Song, and Eddy Z. Zhang. 2016. Orion: A Framework for GPU Occupancy Tuning. In Proceedings of the 17th International Middleware Conference (Middleware '16). ACM, New York, NY, USA, 18:1--18:13. Google ScholarDigital Library
- Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A Stencil Compiler for Short-vector SIMD Architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13). ACM, 13--24. Google ScholarDigital Library
- HPGMG 2016. High-Performance Geometric Multigrid. https://hpgmg.org/. (2016).Google Scholar
- Hong Jia-Wei and H. T. Kung. 1981. I/O Complexity: The Red-blue Pebble Game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing (STOC '81). ACM, New York, NY, USA, 326--333. Google ScholarDigital Library
- Mengyao Jin, Haohuan Fu, Zihong Lv, and Guangwen Yang. 2016. Libra: An Automated Code Generation and Tuning Framework for Register-limited Stencils on GPUs. In Proceedings of the ACM International Conference on Computing Frontiers (CF '16). ACM, New York, NY, USA, 92--99. Google ScholarDigital Library
- David Ryan Koes and Seth Copen Goldstein. 2006. A Global Progressive Register Allocator. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '06). ACM, New York, NY, USA, 204--215. Google ScholarDigital Library
- Stefan Kral, Franz Franchetti, Juergen Lorenz, Christoph W. Ueberhuber, and Peter Wurzinger. 2004. FFT Compiler Techniques. In Compiler Construction: 13th International Conference, CC 2004. Springer Berlin Heidelberg, 217--231.Google Scholar
- Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04). IEEE Computer Society, Washington, DC, USA, 75-. Google ScholarDigital Library
- A. Li, S. L. Song, A. Kumar, E. Z. Zhang, D. Chavarría-Miranda, and H. Corporaal. 2016. Critical points based register-concurrency auto-tuning for GPUs. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE). 1273--1278. Google ScholarDigital Library
- Guei-Yuan Lueh, Thomas Gross, and Ali-Reza Adl-Tabatabai. 2000. Fusion-based Register Allocation. ACM Trans. Program. Lang. Syst. 22, 3 (May 2000), 431--470. Google ScholarDigital Library
- Hanspeter Mössenböck and Michael Pfeiffer. 2002. Linear Scan Register Allocation in the Context of SSA Form and Register Constraints. Springer Berlin Heidelberg, 229--246. Google ScholarDigital Library
- Rajeev Motwani, Krishna V. Palem, Vivek Sarkar, and Salem Reyen. 1995. Combining Register Allocation and Instruction Scheduling. Technical Report. Stanford, CA, USA. Google ScholarDigital Library
- Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. Poly-Mage: Automatic Optimization for Image Processing Pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, 429--443. Google ScholarDigital Library
- C. Norris and L. L. Pollock. 1993. A scheduler-sensitive global register allocator. In Supercomputing '93. Proceedings. 804--813. Google ScholarDigital Library
- NVCC 2017. NVIDIA CUDA Compiler Driver NVCC. docs.nvidia.com/cuda/cuda-compiler-driver-nvcc. (2017).Google Scholar
- NVprof 2017. NVIDIA Profiler. http://docs.nvidia.com/cuda/profiler-users-guide. (2017).Google Scholar
- Shlomit S. Pinter. 1993. Register Allocation with Instruction Scheduling. In Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation (PLDI '93). ACM, New York, NY, USA, 248--257. Google ScholarDigital Library
- Massimiliano Poletto and Vivek Sarkar. 1999. Linear Scan Register Allocation. ACM Trans. Program. Lang. Syst. 21, 5 (Sept. 1999), 895--913. Google ScholarDigital Library
- Fernando Magno Quintão Pereira and Jens Palsberg. 2008. Register Allocation by Puzzle Solving. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, USA, 216--226. Google ScholarDigital Library
- Mahesh Ravishankar, Justin Holewinski, and Vinod Grover. 2015. Forma: A DSL for Image Processing Applications to Target GPUs and Multi-core CPUs. In Proc. 8th Workshop on General Purpose Processing Using GPUs. 109--120. Google ScholarDigital Library
- Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noel Pouchet, Atanas Rountev, and P. Sadayappan. 2016. Resource Conscious Reuse-Driven Tiling for GPUs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT '16). ACM, 99--111. Google ScholarDigital Library
- Hongbo Rong. 2009. Tree Register Allocation. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, New York, NY, USA, 67--77. Google ScholarDigital Library
- Vivek Sarkar and Rajkishore Barik. 2007. Extended Linear Scan: An Alternate Foundation for Global Register Allocation. In Proceedings of the 16th International Conference on Compiler Construction (CC'07). Springer-Verlag, Berlin, Heidelberg, 141--155. Google ScholarDigital Library
- Ravi Sethi and J. D. Ullman. 1970. The Generation of Optimal Code for Arithmetic Expressions. J. ACM 17, 4 (Oct. 1970), 715--728. Google ScholarDigital Library
- Michael D. Smith, Norman Ramsey, and Glenn Holloway. 2004. A Generalized Algorithm for Graph-coloring Register Allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI '04). ACM, New York, NY, USA, 277--288. Google ScholarDigital Library
- Richard M. Stallman and GCC Developer Community. 2009. Using The GNU Compiler Collection: A GNU Manual For GCC Version 4.3.3. CreateSpace, Paramount, CA. Google ScholarDigital Library
- Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, Fabrice Rastello, J. Ramanujam, and P. Sadayappan. 2014. A Framework for Enhancing Data Reuse via Associative Reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 65--76. Google ScholarDigital Library
- SW4 2014. Seismic Wave Modelling (SW4) - Computational Infrastructure for Geodynamics. https://geodynamics.org/cig/software/sw4/. (2014).Google Scholar
- Sid Touati and Christine Eisenbeis. 2004. Early Periodic Register Allocation on ILP Processors. Parallel Processing Letters 14, 2 (June 2004), 287--313.Google ScholarCross Ref
- Swapneela Unkule, Christopher Shaltz, and Apan Qasem. 2012. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality. In Proceedings of the 21st International Conference on Compiler Construction (CC'12). Springer-Verlag, Berlin, Heidelberg, 21--40. Google ScholarDigital Library
- Mohamed Wahib and Naoya Maruyama. 2014. Scalable Kernel Fusion for Memory-bound GPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, 191--202. Google ScholarDigital Library
- Mohamed Wahib and Naoya Maruyama. 2015. Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15). ACM, New York, NY, USA, 259--270. Google ScholarDigital Library
- Jian Wang, Andreas Krall, M. Anton Ertl, and Christine Eisenbeis. 1994. Software Pipelining with Register Allocation and Spilling. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO 27). ACM, New York, NY, USA, 95--99. Google ScholarDigital Library
- Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt. 2016. gpucc: An Open-source GPGPU Compiler. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO '16). 105--116. Google ScholarDigital Library
- Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 395--406. Google ScholarDigital Library
- Jingling Xue. 1997. On Tiling as a Loop Transformation. Parallel Processing Letters 07, 04 (1997), 409--424.Google ScholarCross Ref
Index Terms
- Register optimizations for stencils on GPUs
Recommendations
Register optimizations for stencils on GPUs
PPoPP '18The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means ...
Performance evaluation of OpenMP's target construct on GPUs-exploring compiler optimisations
OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP's high-level parallel abstractions with accelerator programming. This ...
Comments