ABSTRACT
Register allocation is a well-studied problem, but surprisingly little work has been published on assigning registers for GPU architectures. In this paper we present the register allocator in the production compiler for Intel HD and Iris Graphics. Intel GPUs feature a large byte-addressable register file organized into banks, an expressive instruction set that supports variable SIMD-sizes and divergent control flow, and high spill overhead due to relatively long memory latencies. These distinctive characteristics impose challenges for register allocation, as input programs may have arbitrarily-sized variables, partial updates, and complex control flow. Not only should the allocator make a program spill-free, but it must also reduce the number of register bank conflicts and anti-dependencies. Since compilation occurs in a JIT environment, the allocator also needs to incur little overhead.
To manage compilation overhead, our register allocation framework adopts a hybrid approach that separates the assignment of local and global variables. Several extensions are introduced to the traditional graph-coloring algorithm to support variables with different sizes and to accurately model liveness under divergent branches. Different assignment polices are applied to exploit the trade-offs between minimizing register usage and avoiding bank conflicts and anti-dependencies. Experimental results show our framework produces very few spilling kernels and can improve RA JIT time by up to 4x over pure graph-coloring. Our round-robin and bank-conflict-reduction assignment policies can also achieve up to 20% runtime improvements.
- 2K Games. 2013. Bioshock Infinite. (2013). https://bioshockinfinite.ghoststorygames.com/.Google Scholar
- Advanced Micro Devices. 2016. AMD GCN3 ISA Architecture Manual. (2016). http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/.Google Scholar
- Apple. 2017. Apple Metal 2. (2017). https://developer.apple.com/metal/.Google Scholar
- Blizzard Entertainment. 2016. World of Warcraft: Legion. (2016). https://worldofwarcraft.com/.Google Scholar
- David Blythe. 2006. The direct3d 10 system. In ACM Transactions on Graphics (TOG), Vol. 25. ACM, 724-734. Google ScholarDigital Library
- Matthias Braun and Sebastian Hack. 2009. Register Spilling and Live-Range Splitting for SSA-Form Programs. In Proceedings of the 18th International Conference on Compiler Construction: Held As Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009 (CC '09). Springer-Verlag, Berlin, Heidelberg, 174-189. Google ScholarDigital Library
- Preston Briggs, Keith D Cooper, and Linda Torczon. 1994. Improvements to graph coloring register allocation. ACM Transactions on Programming Languages and Systems (TOPLAS) 16, 3 (1994), 428-455. Google ScholarDigital Library
- John Cavazos, J. Eliot B. Moss, and Michael F. P. O'Boyle. 2006. Hybrid Optimizations: Which Optimization Algorithm to Use?. In Proceedings of the 15th International Conference on Compiler Construction (CC'06). Springer-Verlag, Berlin, Heidelberg, 124-138. Google ScholarDigital Library
- Gregory Chaitin. 2004. Register allocation and spilling via graph coloring. Acm Sigplan Notices 39, 4 (2004), 66-74. Google ScholarDigital Library
- Fred C Chow and John L Hennessy. 1990. The priority-based coloring approach to register allocation. ACM Transactions on Programming Languages and Systems (TOPLAS) 12, 4 (1990), 501-536. Google ScholarDigital Library
- Quentin Colombet, Benoit Boissinot, Philip Brisk, Sebastian Hack, and Fabrice Rastello. 2011. Graph-coloring and Treescan Register Allocation Using Repairing. In Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES '11). ACM, New York, NY, USA, 45-54. Google ScholarDigital Library
- Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. 2011. SIMD re-convergence at thread frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 477-488. Google ScholarDigital Library
- Ahmed ElTantawy, Jessica Wenjie Ma, Mike O'Connor, and Tor M Aamodt. 2014. A scalable multi-path microarchitecture for efficient GPU control flow. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 248-259.Google ScholarCross Ref
- Wilson WL Fung, Ivan Sham, George Yuan, and Tor M Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 407-420. Google ScholarDigital Library
- Futuremark. 2017. 3DMark. (2017). https://www.futuremark.com/benchmarks/3dmark.Google Scholar
- Futuremark. 2017. VRMark - the virtual reality benchmark. (2017). https://www.futuremark.com/benchmarks/vrmark.Google Scholar
- James R Goodman and W-C Hsu. 1988. Code scheduling and register allocation in large basic blocks. In Proceedings of the 2nd international conference on Supercomputing. ACM, 442-452. Google ScholarDigital Library
- Sebastian Hack, Daniel Grund, and Gerhard Goos. 2006. Register allocation for programs in SSA-form. CC 6 (2006), 247-262. Google ScholarDigital Library
- Tianyi David Han and Tarek S Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units. ACM, 3. Google ScholarDigital Library
- Ari B Hayes and Eddy Z Zhang. 2014. Unified on-chip memory allocation for SIMT architecture. In Proceedings of the 28th ACM international conference on Supercomputing. ACM, 293-302. Google ScholarDigital Library
- Intel 2017. Intel Graphics for Linux documentation. (2017). https://01.org/linuxgraphics/documentation.Google Scholar
- Intel 2017. Intel Processor Graphics. (2017). https://software.intel.com/en-us/articles/intel-graphics-developers-guides.Google Scholar
- Chi keung Luk. 2015. MICRO48-Tutorial on Intel Processor Graphics: Architecture and Programming. (2015). https://software.intel.com/en-us/blogs/2015/08/27/micro48-tutorial-on-intel-processor-graphics-architecture-and-programming.Google Scholar
- Kishonti Ltd. 2017. CompuBench - performance benchmark for various compute APIs. (2017). https://compubench.com/.Google Scholar
- Kishonti Ltd. 2017. GFXBench. (2017). https://gfxbench.com/benchmark.jsp.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097-1105. Google ScholarDigital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04). IEEE Computer Society, Washington, DC, USA, 75-. http://dl.acm.org/citation.cfm?id=977395.977673 Google ScholarDigital Library
- Guei-Yuan Lueh, Thomas Gross, and Ali-Reza Adl-Tabatabai. 2000. Fusion-based Register Allocation. ACM Trans. Program. Lang. Syst. 22, 3 (May 2000), 431-470. Google ScholarDigital Library
- Sparsh Mittal. 2017. A survey of techniques for architecting and managing GPU register file. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 16-28. Google ScholarDigital Library
- Rajeev Motwani, Krishna V Palem, Vivek Sarkar, and Salem Reyen. 1995. Combining register allocation and instruction scheduling. Courant Institute, New York University (1995).Google Scholar
- Brian R Nickerson. 1990. Graph coloring register allocation for processors with multi-register operands. In ACM SIGPLAN Notices, Vol. 25. ACM, 40-52. Google ScholarDigital Library
- Nvidia 2016. Pascal Architecture Whitepaper. (2016). https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.Google Scholar
- Fernando Magno Quintao Pereira and Jens Palsberg. 2005. Register allocation via coloring of chordal graphs. In APLAS, Vol. 5. Springer, 315-329. Google ScholarDigital Library
- Shlomit S Pinter. 1993. Register allocation with instruction scheduling. In ACM SIGPLAN Notices, Vol. 28. ACM, 248-257. Google ScholarDigital Library
- Massimiliano Poletto and Vivek Sarkar. 1999. Linear scan register allocation. ACM Transactions on Programming Languages and Systems (TOPLAS) 21, 5 (1999), 895-913. Google ScholarDigital Library
- Fernando Magno Quintão Pereira and Jens Palsberg. 2008. Register Allocation by Puzzle Solving. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, USA, 216-226. Google ScholarDigital Library
- Minsoo Rhu and Mattan Erez. 2013. The dual-path execution model for efficient GPU control flow. In High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on. IEEE, 591-602. Google ScholarDigital Library
- Johan Runeson and Sven-Olof Nyström. 2003. Retargetable graph-coloring register allocation for irregular architectures. In SCOPES. Springer, 240-254.Google Scholar
- Diogo Sampaio, Rafael Martins de Souza, Sylvain Collange, and Fernando Magno Quintão Pereira. 2014. Divergence Analysis. ACM Trans. Program. Lang. Syst. 35, 4, Article 13 (Jan. 2014), 36 pages.Google Scholar
- Dave Shreiner, Graham Sellers, John Kessenich, and Bill Licea-Kane. 2013. OpenGL programming guide: The Official guide to learning OpenGL, version 4.3. Addison-Wesley. Google ScholarDigital Library
- SiSoftware. 2016. Sandra Platinum. (2016). http://www.sisoftware.net.Google Scholar
- Michael D. Smith, Norman Ramsey, and Glenn Holloway. 2004. A Generalized Algorithm for Graph-coloring Register Allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI '04). ACM, New York, NY, USA, 277-288. Google ScholarDigital Library
- John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66-73. Google ScholarDigital Library
- The Khronos Group. 2016. Vulkan Overview. (2016). https://www.khronos.org/vulkan/.Google Scholar
- Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 395-406. Google ScholarDigital Library
Index Terms
- Register allocation for Intel processor graphics
Recommendations
Differential register allocation
PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementationMicro-architecture designers are very cautious about expanding the number of architected registers (also the register field), because increasing the register field adds to the code size, raises I-cache and memory pressure, complicates processor ...
Differential register allocation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementationMicro-architecture designers are very cautious about expanding the number of architected registers (also the register field), because increasing the register field adds to the code size, raises I-cache and memory pressure, complicates processor ...
Bytewise Register Allocation
SCOPES '15: Proceedings of the 18th International Workshop on Software and Compilers for Embedded SystemsTraditionally, variables have been considered as atoms by register allocation: Each variable was to be placed in one register, or spilt (placed in main memory) or rematerialized (recalculated as needed). Some flexibility arose from what would be ...
Comments