Abstract
Growing interest in graphics processing units has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers tremendous computational power; however, programming them is still challenging. In particular, developers must deal with memory and control-flow divergences. These phenomena stem from a condition that we call data divergence, which occurs whenever two processing elements (PEs) see the same variable name holding different values. This article introduces divergence analysis, a static analysis that discovers data divergences. This analysis, currently deployed in an industrial quality compiler, is useful in several ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the automatic optimization of SIMD programs. We demonstrate this last point by introducing the notion of a divergence-aware register spiller. This spiller uses information from our analysis to either rematerialize or share common data between PEs. As a testimony of its effectiveness, we have tested it on a suite of 395 CUDA kernels from well-known benchmarks. The divergence-aware spiller produces GPU code that is 26.21% faster than the code produced by the register allocator used in the baseline compiler.
- Abel, N. E., Budnik, P. P., Kuck, D. J., Muraoka, Y., Northcote, R. S., and Wilhelmson, R. B. 1969. TRANQUIL: A language for an array processing computer. In Proceedings of AFIPS. ACM, 57--73. Google ScholarDigital Library
- Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. 2006. Compilers: Principles, Techniques, and Tools 2nd Ed. Addison Wesley. Google ScholarDigital Library
- Aiken, A. and Gay, D. 1998. Barrier inference. In Proceedings of POPL. ACM, 342--354. Google ScholarDigital Library
- Appel, A. W. 1998. SSA is functional programming. SIGPLAN Not. 33, 4, 17--20. Google ScholarDigital Library
- Baghsorkhi, S. S., Delahaye, M., Patel, S. J., Gropp, W. D., and Hwu, W.-M. W. 2010. An adaptive performance modeling tool for GPU architectures. In Proceedings of PPoPP. ACM, 105--114. Google ScholarDigital Library
- Belady, L. A. 1966. A study of replacement algorithms for a virtual storage computer. IBM Syst. J. 5, 2, 78--101. Google ScholarDigital Library
- Blelloch, G. and Chatterjee, S. 1990. Vcode: A data-parallel intermediate language. In Proceedings of FMPC. ACM, 471--480.Google Scholar
- Boudier, P. and Sellers, G. 2011. Memory system on Fusion APUs. In Proceedings of the AMD Fusion Developer Summit. AMD.Google Scholar
- Bougé, L. and Levaire, J.-L. 1992. Control structures for data-parallel SIMD languages: Semantics and implementation. Future Gen. Comput. Syst. 8, 4, 363--378. Google ScholarDigital Library
- Bouknight, W., Denenberg, S. A., McIntyre, D. E., Randall, J. M., Sameh, A. H., and Slotnick, D. L. 1972. The Illiac IV system. IEEE 60, 4, 369--388.Google ScholarCross Ref
- Briggs, P., Cooper, K. D., and Torczon, L. 1992. Rematerialization. In Proceedings of PLDI. ACM, 311--321. Google ScholarDigital Library
- Brockmann, K. and Wanka, R. 1997. Efficient oblivious parallel sorting on the MasPar MP-1. Vol. ICSS. 1, 200. Google ScholarDigital Library
- Budimlic, Z., Cooper, K. D., Harvey, T. J., Kennedy, K., Oberg, T. S., and Reeves, S. W. 2002. Fast copy coalescing and live-range identification. In Proceedings of PLDI. ACM, 25--32. Google ScholarDigital Library
- Carrillo, S., Siegel, J., and Li, X. 2009. A control-structure splitting optimization for GPGPU. In Proceedings of the 6th ACM Conference Computing Frontiers. ACM, 147--150. Google ScholarDigital Library
- Cederman, D. and Tsigas, P. 2009. GPU-quicksort: A practical quicksort algorithm for graphics processors. J. Exp. Algorithmics 14, 1, 4--24. Google ScholarDigital Library
- Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.-H., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). IEEE, 44--54. Google ScholarDigital Library
- Choi, J.-D., Cytron, R., and Ferrante, J. 1991. Automatic construction of sparse data flow evaluation graphs. In Proceedings of POPL. ACM, 55--66. Google ScholarDigital Library
- Collange, S., Defour, D., and Zhang, Y. 2009. Dynamic detection of uniform and affine vectors in GPGPU computations. In Proceedings of the Parallel Processing Workshop (HPPC). Lecture Notes in computer science, vol. 6043, Springer, 46--55. Google ScholarDigital Library
- Cousot, P. and Halbwachs, N. 1978. Automatic discovery of linear restraints among variables of a program. In Proceedings of POPL. ACM, 84--96. Google ScholarDigital Library
- Coutinho, B., Sampaio, D., Pereira, F. M. Q., and Meira, W. 2011. Divergence analysis and optimizations. In Proceedings of International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, 320--329. Google ScholarDigital Library
- Coutinho, B., Sampaio, D., Pereira, F. M. Q., and Meira, W. 2013. Profiling divergences in GPU applications. Concur. Comput. Pract. Exp. 25, 6, 775--789.Google ScholarCross Ref
- Cunningham, D., Bordawekar, R., and Saraswat, V. 2011. GPU programming in a high level language: Compiling ×10 to CUDA. In Proceedings of the ACM SIGPLAN X10 Workshop. ACM, 8:1--8:10. Google ScholarDigital Library
- Cytron, R., Ferrante, J., Rosen, B. K., Wegman, M. N., and Zadeck, F. K. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13, 4, 451--490. Google ScholarDigital Library
- Darema, F., George, D. A., Norton, V. A., and Pfister, G. F. 1988. A single-program-multiple-data computational model for EPEX/FORTRAN. Parallel Comput. 7, 1, 11--24.Google ScholarCross Ref
- Diamos, G., Kerr, A., Yalamanchili, S., and Clark, N. 2010. Ocelot, a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, 354--364. Google ScholarDigital Library
- Dubach, C., Cheng, P., Rabbah, R., Bacon, D. F., and Fink, S. J. 2012. Compiling a high-level language for GPUS: (via language support for architectures and compilers). In Proceedings of ACM SIGPLAN PLDI. ACM, 1--12. Google ScholarDigital Library
- Farrell, C. A. and Kieronska, D. H. 1996. Formal specification of parallel SIMD execution. Theo. Comp. Science 169, 1, 39--65. Google ScholarDigital Library
- Ferrante, J., Ottenstein, K. J., and Warren, J. D. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3, 319--349. Google ScholarDigital Library
- Flynn, M. 1972. Some computer organizations and their effectiveness. IEEE Trans. Comput. 21, 9, 948--960. Google ScholarDigital Library
- Fung, W. W. L., Sham, I., Yuan, G., and Aamodt, T. M. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of IEEE/ACM MICRO. IEEE, 407--420. Google ScholarDigital Library
- Garland, M. and Kirk, D. B. 2010. Understanding throughput-oriented architectures. Comm. ACM 53, 58--66. Google ScholarDigital Library
- Gou, C. and Gaydadjiev, G. 2013. Addressing GPU on-chip shared memory bank conflicts using elastic pipeline. Int. J. Parallel Program. 41, 3, 400--429.Google ScholarCross Ref
- Grover, V., Joannes, B., Aarts, M., and Murphy, M. 2009. Variance analysis for translating CUDA code for execution by a general purpose processor. U.S. Patent 2009/0259997, Filed March 31, 2009, and issued October 15, 2009.Google Scholar
- Hack, S. and Goos, G. 2006. Optimal register allocation for SSA-form programs in polynomial time. Inf. Process. Lett. 98, 4, 150--155. Google ScholarDigital Library
- Han, T. D. and Abdelrahman, T. S. 2011. Reducing branch divergence in GPU programs. In Proceedings of GPGPU-4. ACM, 3:1--3:8. Google ScholarDigital Library
- Jang, B., Schaa, D., Mistry, P., and Kaeli, D. 2010. Static memory access pattern analysis on a massively parallel GPU. In Proceedings of PLDI. ACM.Google Scholar
- Karrenberg, R. and Hack, S. 2011. Whole-function vectorization. In Proceedings of the 9th IEEE/ACM CGO. IEEE, 141--150. Google ScholarDigital Library
- Karrenberg, R. and Hack, S. 2012. Improving performance of OpenCL on CPUS. In Proceedings of CC. Springer, 1--20. Google ScholarDigital Library
- Keryell, R., Materat, P., and Paris, N. 1991. POMP, or how to design a massively parallel machine with small developments. In Proceedings of Parallel Architectures and Languages Europe (PARLE). Lecture Note in Computer Science, vol. 505, Springer, 83--100.Google ScholarCross Ref
- Kung, S.-Y., Arun, K. S., Gal-Ezer, R. J., and Bhaskar Rao, D. V. 1982. Wavefront array processor: Language, architecture, and applications. IEEE Trans. Comput. 31, 1054--1066. Google ScholarDigital Library
- Lashgar, A. and Baniasadi, A. 2011. Performance in GPU architectures: Potentials and distances. In Proceedings of the 9th Workshop on Duplicating, Deconstructing, and Debunking (WDDD). IEEE, 75--81.Google Scholar
- Lawrie, D. H., Layman, T., Baer, D., and Randal, J. M. 1975. GLYPNIR-a programming language for ILLIAC IV. Comm. ACM 18, 3, 157--164. Google ScholarDigital Library
- Lee, S., Min, S.-J., and Eigenmann, R. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of ACM SIGPLAN PPoPP. ACM, 101--110. Google ScholarDigital Library
- Lee, Y., Avizienis, R., Bishara, A., Xia, R., Lockhart, D., Batten, C., and Asanovic, K. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA). ACM, 129--140. Google ScholarDigital Library
- Lee, Y., Krashinsky, R., Grover, V., Keckler, S. W., and Asanovic, K. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the IEEE/ACM CGO. ACM, 1--11. Google ScholarDigital Library
- Leiba, R., Hack, S., and Wald, I. 2012. Extending a C-like language for portable SIMD programming. In Proceedings of ACM SIGPLAN PPOPP. ACM, 65--74. Google ScholarDigital Library
- Meng, J., Tarjan, D., and Skadron, K. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the International Symposium on Computer Architecture (ISCA). ACM, 235--246. Google ScholarDigital Library
- Miné, A. 2006. The octagon abstract domain. Higher Order Symbol. Comput. 19, 31--100. Google ScholarDigital Library
- Mu, S., Zhang, X., Zhang, N., Lu, J., Deng, Y. S., and Zhang, S. 2010. IP routing processing with graphic processors. In Proceedings of DATE. IEEE, 93--98. Google ScholarDigital Library
- Nickolls, J. and Dally, W. J. 2010. The GPU computing era. IEEE Micro 30, 2, 56--69. Google ScholarDigital Library
- Nickolls, J. and Kirk, D. 2009. Graphics and Computing GPUs. Computer Organization and Design, (Patterson and Hennessy) 4th Ed. Elsevier, Chapter A, A.1--A.77.Google Scholar
- Nielson, F., Nielson, H. R., and Hankin, C. 2005. Principles of Program Analysis. Springer. Google ScholarDigital Library
- Ottenstein, K. J., Ballance, R. A., and MacCabe, A. B. 1990. The program dependence Web: A representation supporting control-, data-, and demand-driven interpretation of imperative languages. In Proceedings of ACM SIGPLAN PLDI. ACM, 257--271. Google ScholarDigital Library
- Pereira, F. M. Q. 2011. Dirmap's Blog. http://divmap.wordpress.com/.Google Scholar
- Perrot, R. H. 1979. A language for array and vector processors. ACM Trans. Program. Lang. Syst. 1, 177--195. Google ScholarDigital Library
- Pharr, M. and Mark, W. R. 2012. ISPC: A SPMD compiler for high-performance CPU programming. In Proceedings of Innorative Parallel Computing (InPar).Google Scholar
- Poletto, M. and Sarkar, V. 1999. Linear scan register allocation. ACM Trans. Program. Lang. Syst. 21, 5, 895--913. Google ScholarDigital Library
- Prabhu, T., Ramalingam, S., Might, M., and Hall, M. 2011. EigenCFA: Accelerating flow analysis with GPUs. In Proceedings of POPL. ACM. Google ScholarDigital Library
- Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and mei W. Hwu, W. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of ACM SIGPLAN PPoPP. ACM, 73--82. Google ScholarDigital Library
- Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., and Mendelson, A. 2009. Programming model for a heterogeneous ×86 platform. In Proceedings of ACM SIGPLAN PLDI. ACM, 431--440. Google ScholarDigital Library
- Samadi, M., Hormati, A., Mehrara, M., and Mahlke, S. 2012. Adaptive input-aware compilation for graphics engines. In Proceedings of ACM SIGPLAN PLDI. ACM. Google ScholarDigital Library
- Sampaio, D., Martins, R., Collange, S., and Pereira, F. M. Q. 2012a. Divergence analysis with affine constraints. In Proceedings of the 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 137--146. Google ScholarDigital Library
- Sampaio, D. N., Gedeon, E., Pereira, F. M. Q., and Collange, S. 2012b. Spill code placement for simd machines. In Proceedings of the 16th Brazilian Symposium on Language Programmings (SBLP). Lecture Notes on Computer Science, vol. 7554, Springer, 12--26. Google ScholarDigital Library
- Sandes, E. F. O. and de Melo, A. C. M. 2010. CUDALIGN: Using GPU to accelerate the comparison of megabase genomic sequences. In Proceedings of ACM SIGPLAN PPoPP. ACM, 137--146. Google ScholarDigital Library
- Scholz, B., Zhang, C., and Cifuentes, C. 2008. User-input dependence analysis via graph reachability. Tech. rep., Sun, Inc. Google ScholarDigital Library
- Stratton, J. A., Grover, V., Marathe, J., Aarts, B., Murphy, M., Hu, Z., and Hwu, W.-m. W. 2010. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In Proceedings of IEEE/ACM CGO. IEEE, 111--119. Google ScholarDigital Library
- Stratton, J. A., Rodrigues, C., Sun, I.-J., Obeid, N., Chang, L.-W., Anssari, N., Liu, G. D., and mei W. Hwu, W. 2012. The parboil report. Tech. rep., University of Illinois at Urbana-Champaign.Google Scholar
- Tu, P. and Padua, D. 1995. Efficient building and placing of gating functions. In Proceedings of ACM SIGPLAN PLDI. ACM, 47--55. Google ScholarDigital Library
- Weiser, M. 1981. Program slicing. In Proceedings of ICSE. IEEE, 439--449. Google ScholarDigital Library
- Yang, Y., Xiang, P., Kong, J., and Zhou, H. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of ACM SIGPLAN PLDI. ACM, 86--97. Google ScholarDigital Library
- Zhang, E. Z., Jiang, Y., Guo, Z., and Shen, X. 2010. Streamlining GPU applications on the fly: Thread divergence elimination through runtime thread-data remapping. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS). ACM, 115--126. Google ScholarDigital Library
- Zhang, E. Z., Jiang, Y., Guo, Z., Tian, K., and Shen, X. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of ASPLOS. ACM, 369--380. Google ScholarDigital Library
- Zhang, Y. and Owens, J. D. 2011. A quantitative performance analysis model for GPU architectures. In Proceedings of the IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). ACM, 382--393. Google ScholarDigital Library
Index Terms
- Divergence analysis
Recommendations
Pointer-Based Divergence Analysis for OpenCL 2.0 Programs
A modern GPU is designed with many large thread groups to achieve a high throughput and performance. Within these groups, the threads are grouped into fixed-size SIMD batches in which the same instruction is applied to vectors of data in a lockstep. This ...
Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads
Highlights- We propose a new thread synchronization heuristic called Min-SP/PC.
- Min-SP/PC ...
AbstractSimultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same processing unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is one ...
Divergence Analysis with Affine Constraints
SBAC-PAD '12: Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance ComputingThe rising popularity of graphics processing units is bringing renewed interest in code optimization techniques for SIMD processors. Many of these optimizations rely on divergence analyses, which classify variables as uniform, if they have the same ...
Comments