skip to main content
10.1145/3178487.3178500acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections

Register optimizations for stencils on GPUs

Published:10 February 2018Publication History

ABSTRACT

The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse. However, the resulting code is often highly constrained by register pressure. While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in sub-optimal code with a large number of register spills. In this paper, we develop a statement reordering framework that models stencil computations as a DAG of trees with shared leaves, and adapts an optimal scheduling algorithm for minimizing register usage for expression trees. The effectiveness of the approach is demonstrated through experimental results on a range of stencils extracted from application codes.

Skip Supplemental Material Section

Supplemental Material

References

  1. A. Aho, M. Lam, R. Sethi, and J. Ullman. 2007. Compilers: Principles, Techniques, and Tools (2nd ed). Pearson. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. V. Aho and S. C. Johnson. 1975. Optimal Code Generation for Expression Trees. In Proceedings of Seventh Annual ACM Symposium on Theory of Computing (STOC '75). ACM, New York, NY, USA, 207--217. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. V. Aho, S. C. Johnson, and J. D. Ullman. 1977. Code Generation for Expressions with Common Subexpressions. J. ACM 24, 1 (Jan. 1977), 146--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. W. Appel and K. J. Supowit. 1987. Generalization of the Sethi-Ullman Algorithm for Register Allocation. Softw. Pract. Exper. 17, 6 (June 1987), 417--421. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling Stencil Computations to Maximize Parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 40, 11 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Basu, M. Hall, S. Williams, B. V. Straalen, L. Oliker, and P. Colella. 2015. Compiler-Directed Transformation for Higher-Order Stencils. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. 313--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. David A. Berson, Rajiv Gupta, and Mary Lou Soffa. 1999. Integrated Instruction Scheduling and Register Allocation Techniques. In Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing (LCPC '98). Springer-Verlag, London, UK, UK, 247--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, USA, 101--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Preston Briggs, Keith D. Cooper, and Linda Torczon. 1992. Rematerialization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation (PLDI '92). ACM, New York, NY, USA, 311--321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Preston Briggs, Keith D. Cooper, and Linda Torczon. 1994. Improvements to Graph Coloring Register Allocation. ACM Trans. Program. Lang. Syst. 16, 3 (May 1994), 428--455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. J. Chaitin. 1982. Register Allocation & Spilling via Graph Coloring. In Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction (SIGPLAN '82). ACM, New York, NY, USA, 98--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. M. Codina, J. Sanchez, and A. Gonzalez. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques. 175--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Q. Colombet, B. Boissinot, P. Brisk, S. Hack, and F. Rastello. 2011. Graph-coloring and treescan register allocation using repairing. In 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). 45--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Raúl De La Cruz, Mauricio Araya-Polo, and José María Cela. 2010. Introducing the Semi-stencil Algorithm. In Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I (PPAM'09). Springer-Verlag, Berlin, Heidelberg, 496--506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Steven J. Deitz, Bradford L. Chamberlain, and Lawrence Snyder. 2001. Eliminating Redundancies in Sum-of-product Array Computations. In Proceedings of the 15th International Conference on Supercomputing (ICS '01). ACM, New York, NY, USA, 65--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lukasz Domagala, Duco van Amstel, Fabrice Rastello, and P. Sadayappan. 2016. Register Allocation and Promotion Through Combined Instruction Scheduling and Loop Unrolling. In Proceedings of the 25th International Conference on Compiler Construction (CC 2016). ACM, New York, NY, USA, 143--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. ExaCT 2013. ExaCT: Center for Exascale Simulation of Combustion in Turbulence: Proxy App Software. https://exactcodesign.org/proxy-app-software/. (2013).Google ScholarGoogle Scholar
  18. M. Frigo and S. G. Johnson. 2005. The Design and Implementation of FFTW3. Proc. IEEE 93, 2 (Feb 2005), 216--231.Google ScholarGoogle ScholarCross RefCross Ref
  19. Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-performance Matrix Multiplication. ACM Trans. Math. Softw. 34, 3, Article 12 (May 2008), 25 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ramaswamy Govindarajan, H. Yang, Chihong Zhang, José N. Amaral, and Guang R. Gao. 2001. Minimum Register Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs. In Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS '01). IEEE Computer Society, Washington, DC, USA, 26--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14). ACM, Article 66, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Tobias Gysi, Tobias Grosser, and Torsten Hoefler. 2015. MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, 177--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mary Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy, and Malik Murtaza Khan. 2010. Loop Transformation Recipes for Code Generation and Auto-tuning. In Proceedings of the 22Nd International Conference on Languages and Compilers for Parallel Computing (LCPC'09). Springer-Verlag, Berlin, Heidelberg, 50--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ari B. Hayes, Lingda Li, Daniel Chavarría-Miranda, Shuaiwen Leon Song, and Eddy Z. Zhang. 2016. Orion: A Framework for GPU Occupancy Tuning. In Proceedings of the 17th International Middleware Conference (Middleware '16). ACM, New York, NY, USA, 18:1--18:13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A Stencil Compiler for Short-vector SIMD Architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13). ACM, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. HPGMG 2016. High-Performance Geometric Multigrid. https://hpgmg.org/. (2016).Google ScholarGoogle Scholar
  27. Hong Jia-Wei and H. T. Kung. 1981. I/O Complexity: The Red-blue Pebble Game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing (STOC '81). ACM, New York, NY, USA, 326--333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mengyao Jin, Haohuan Fu, Zihong Lv, and Guangwen Yang. 2016. Libra: An Automated Code Generation and Tuning Framework for Register-limited Stencils on GPUs. In Proceedings of the ACM International Conference on Computing Frontiers (CF '16). ACM, New York, NY, USA, 92--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. David Ryan Koes and Seth Copen Goldstein. 2006. A Global Progressive Register Allocator. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '06). ACM, New York, NY, USA, 204--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Stefan Kral, Franz Franchetti, Juergen Lorenz, Christoph W. Ueberhuber, and Peter Wurzinger. 2004. FFT Compiler Techniques. In Compiler Construction: 13th International Conference, CC 2004. Springer Berlin Heidelberg, 217--231.Google ScholarGoogle Scholar
  31. Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04). IEEE Computer Society, Washington, DC, USA, 75-. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Li, S. L. Song, A. Kumar, E. Z. Zhang, D. Chavarría-Miranda, and H. Corporaal. 2016. Critical points based register-concurrency auto-tuning for GPUs. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE). 1273--1278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Guei-Yuan Lueh, Thomas Gross, and Ali-Reza Adl-Tabatabai. 2000. Fusion-based Register Allocation. ACM Trans. Program. Lang. Syst. 22, 3 (May 2000), 431--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Hanspeter Mössenböck and Michael Pfeiffer. 2002. Linear Scan Register Allocation in the Context of SSA Form and Register Constraints. Springer Berlin Heidelberg, 229--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Rajeev Motwani, Krishna V. Palem, Vivek Sarkar, and Salem Reyen. 1995. Combining Register Allocation and Instruction Scheduling. Technical Report. Stanford, CA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. Poly-Mage: Automatic Optimization for Image Processing Pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, 429--443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Norris and L. L. Pollock. 1993. A scheduler-sensitive global register allocator. In Supercomputing '93. Proceedings. 804--813. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. NVCC 2017. NVIDIA CUDA Compiler Driver NVCC. docs.nvidia.com/cuda/cuda-compiler-driver-nvcc. (2017).Google ScholarGoogle Scholar
  39. NVprof 2017. NVIDIA Profiler. http://docs.nvidia.com/cuda/profiler-users-guide. (2017).Google ScholarGoogle Scholar
  40. Shlomit S. Pinter. 1993. Register Allocation with Instruction Scheduling. In Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation (PLDI '93). ACM, New York, NY, USA, 248--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Massimiliano Poletto and Vivek Sarkar. 1999. Linear Scan Register Allocation. ACM Trans. Program. Lang. Syst. 21, 5 (Sept. 1999), 895--913. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Fernando Magno Quintão Pereira and Jens Palsberg. 2008. Register Allocation by Puzzle Solving. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, USA, 216--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Mahesh Ravishankar, Justin Holewinski, and Vinod Grover. 2015. Forma: A DSL for Image Processing Applications to Target GPUs and Multi-core CPUs. In Proc. 8th Workshop on General Purpose Processing Using GPUs. 109--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noel Pouchet, Atanas Rountev, and P. Sadayappan. 2016. Resource Conscious Reuse-Driven Tiling for GPUs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT '16). ACM, 99--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Hongbo Rong. 2009. Tree Register Allocation. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, New York, NY, USA, 67--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Vivek Sarkar and Rajkishore Barik. 2007. Extended Linear Scan: An Alternate Foundation for Global Register Allocation. In Proceedings of the 16th International Conference on Compiler Construction (CC'07). Springer-Verlag, Berlin, Heidelberg, 141--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ravi Sethi and J. D. Ullman. 1970. The Generation of Optimal Code for Arithmetic Expressions. J. ACM 17, 4 (Oct. 1970), 715--728. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Michael D. Smith, Norman Ramsey, and Glenn Holloway. 2004. A Generalized Algorithm for Graph-coloring Register Allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI '04). ACM, New York, NY, USA, 277--288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Richard M. Stallman and GCC Developer Community. 2009. Using The GNU Compiler Collection: A GNU Manual For GCC Version 4.3.3. CreateSpace, Paramount, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, Fabrice Rastello, J. Ramanujam, and P. Sadayappan. 2014. A Framework for Enhancing Data Reuse via Associative Reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 65--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. SW4 2014. Seismic Wave Modelling (SW4) - Computational Infrastructure for Geodynamics. https://geodynamics.org/cig/software/sw4/. (2014).Google ScholarGoogle Scholar
  52. Sid Touati and Christine Eisenbeis. 2004. Early Periodic Register Allocation on ILP Processors. Parallel Processing Letters 14, 2 (June 2004), 287--313.Google ScholarGoogle ScholarCross RefCross Ref
  53. Swapneela Unkule, Christopher Shaltz, and Apan Qasem. 2012. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality. In Proceedings of the 21st International Conference on Compiler Construction (CC'12). Springer-Verlag, Berlin, Heidelberg, 21--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Mohamed Wahib and Naoya Maruyama. 2014. Scalable Kernel Fusion for Memory-bound GPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, 191--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Mohamed Wahib and Naoya Maruyama. 2015. Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15). ACM, New York, NY, USA, 259--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Jian Wang, Andreas Krall, M. Anton Ertl, and Christine Eisenbeis. 1994. Software Pipelining with Register Allocation and Spilling. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO 27). ACM, New York, NY, USA, 95--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt. 2016. gpucc: An Open-source GPGPU Compiler. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO '16). 105--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 395--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Jingling Xue. 1997. On Tiling as a Loop Transformation. Parallel Processing Letters 07, 04 (1997), 409--424.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Register optimizations for stencils on GPUs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
      February 2018
      442 pages
      ISBN:9781450349826
      DOI:10.1145/3178487
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 53, Issue 1
        PPoPP '18
        January 2018
        426 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/3200691
        Issue’s Table of Contents

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 February 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate230of1,014submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader