skip to main content
research-article
Open Access

Divergence analysis

Published:03 January 2014Publication History
Skip Abstract Section

Abstract

Growing interest in graphics processing units has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers tremendous computational power; however, programming them is still challenging. In particular, developers must deal with memory and control-flow divergences. These phenomena stem from a condition that we call data divergence, which occurs whenever two processing elements (PEs) see the same variable name holding different values. This article introduces divergence analysis, a static analysis that discovers data divergences. This analysis, currently deployed in an industrial quality compiler, is useful in several ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the automatic optimization of SIMD programs. We demonstrate this last point by introducing the notion of a divergence-aware register spiller. This spiller uses information from our analysis to either rematerialize or share common data between PEs. As a testimony of its effectiveness, we have tested it on a suite of 395 CUDA kernels from well-known benchmarks. The divergence-aware spiller produces GPU code that is 26.21% faster than the code produced by the register allocator used in the baseline compiler.

References

  1. Abel, N. E., Budnik, P. P., Kuck, D. J., Muraoka, Y., Northcote, R. S., and Wilhelmson, R. B. 1969. TRANQUIL: A language for an array processing computer. In Proceedings of AFIPS. ACM, 57--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. 2006. Compilers: Principles, Techniques, and Tools 2nd Ed. Addison Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Aiken, A. and Gay, D. 1998. Barrier inference. In Proceedings of POPL. ACM, 342--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Appel, A. W. 1998. SSA is functional programming. SIGPLAN Not. 33, 4, 17--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Baghsorkhi, S. S., Delahaye, M., Patel, S. J., Gropp, W. D., and Hwu, W.-M. W. 2010. An adaptive performance modeling tool for GPU architectures. In Proceedings of PPoPP. ACM, 105--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Belady, L. A. 1966. A study of replacement algorithms for a virtual storage computer. IBM Syst. J. 5, 2, 78--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Blelloch, G. and Chatterjee, S. 1990. Vcode: A data-parallel intermediate language. In Proceedings of FMPC. ACM, 471--480.Google ScholarGoogle Scholar
  8. Boudier, P. and Sellers, G. 2011. Memory system on Fusion APUs. In Proceedings of the AMD Fusion Developer Summit. AMD.Google ScholarGoogle Scholar
  9. Bougé, L. and Levaire, J.-L. 1992. Control structures for data-parallel SIMD languages: Semantics and implementation. Future Gen. Comput. Syst. 8, 4, 363--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bouknight, W., Denenberg, S. A., McIntyre, D. E., Randall, J. M., Sameh, A. H., and Slotnick, D. L. 1972. The Illiac IV system. IEEE 60, 4, 369--388.Google ScholarGoogle ScholarCross RefCross Ref
  11. Briggs, P., Cooper, K. D., and Torczon, L. 1992. Rematerialization. In Proceedings of PLDI. ACM, 311--321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Brockmann, K. and Wanka, R. 1997. Efficient oblivious parallel sorting on the MasPar MP-1. Vol. ICSS. 1, 200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Budimlic, Z., Cooper, K. D., Harvey, T. J., Kennedy, K., Oberg, T. S., and Reeves, S. W. 2002. Fast copy coalescing and live-range identification. In Proceedings of PLDI. ACM, 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Carrillo, S., Siegel, J., and Li, X. 2009. A control-structure splitting optimization for GPGPU. In Proceedings of the 6th ACM Conference Computing Frontiers. ACM, 147--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cederman, D. and Tsigas, P. 2009. GPU-quicksort: A practical quicksort algorithm for graphics processors. J. Exp. Algorithmics 14, 1, 4--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.-H., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). IEEE, 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Choi, J.-D., Cytron, R., and Ferrante, J. 1991. Automatic construction of sparse data flow evaluation graphs. In Proceedings of POPL. ACM, 55--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Collange, S., Defour, D., and Zhang, Y. 2009. Dynamic detection of uniform and affine vectors in GPGPU computations. In Proceedings of the Parallel Processing Workshop (HPPC). Lecture Notes in computer science, vol. 6043, Springer, 46--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Cousot, P. and Halbwachs, N. 1978. Automatic discovery of linear restraints among variables of a program. In Proceedings of POPL. ACM, 84--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Coutinho, B., Sampaio, D., Pereira, F. M. Q., and Meira, W. 2011. Divergence analysis and optimizations. In Proceedings of International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, 320--329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Coutinho, B., Sampaio, D., Pereira, F. M. Q., and Meira, W. 2013. Profiling divergences in GPU applications. Concur. Comput. Pract. Exp. 25, 6, 775--789.Google ScholarGoogle ScholarCross RefCross Ref
  22. Cunningham, D., Bordawekar, R., and Saraswat, V. 2011. GPU programming in a high level language: Compiling ×10 to CUDA. In Proceedings of the ACM SIGPLAN X10 Workshop. ACM, 8:1--8:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Cytron, R., Ferrante, J., Rosen, B. K., Wegman, M. N., and Zadeck, F. K. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13, 4, 451--490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Darema, F., George, D. A., Norton, V. A., and Pfister, G. F. 1988. A single-program-multiple-data computational model for EPEX/FORTRAN. Parallel Comput. 7, 1, 11--24.Google ScholarGoogle ScholarCross RefCross Ref
  25. Diamos, G., Kerr, A., Yalamanchili, S., and Clark, N. 2010. Ocelot, a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, 354--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Dubach, C., Cheng, P., Rabbah, R., Bacon, D. F., and Fink, S. J. 2012. Compiling a high-level language for GPUS: (via language support for architectures and compilers). In Proceedings of ACM SIGPLAN PLDI. ACM, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Farrell, C. A. and Kieronska, D. H. 1996. Formal specification of parallel SIMD execution. Theo. Comp. Science 169, 1, 39--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ferrante, J., Ottenstein, K. J., and Warren, J. D. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3, 319--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Flynn, M. 1972. Some computer organizations and their effectiveness. IEEE Trans. Comput. 21, 9, 948--960. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Fung, W. W. L., Sham, I., Yuan, G., and Aamodt, T. M. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of IEEE/ACM MICRO. IEEE, 407--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Garland, M. and Kirk, D. B. 2010. Understanding throughput-oriented architectures. Comm. ACM 53, 58--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Gou, C. and Gaydadjiev, G. 2013. Addressing GPU on-chip shared memory bank conflicts using elastic pipeline. Int. J. Parallel Program. 41, 3, 400--429.Google ScholarGoogle ScholarCross RefCross Ref
  33. Grover, V., Joannes, B., Aarts, M., and Murphy, M. 2009. Variance analysis for translating CUDA code for execution by a general purpose processor. U.S. Patent 2009/0259997, Filed March 31, 2009, and issued October 15, 2009.Google ScholarGoogle Scholar
  34. Hack, S. and Goos, G. 2006. Optimal register allocation for SSA-form programs in polynomial time. Inf. Process. Lett. 98, 4, 150--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Han, T. D. and Abdelrahman, T. S. 2011. Reducing branch divergence in GPU programs. In Proceedings of GPGPU-4. ACM, 3:1--3:8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jang, B., Schaa, D., Mistry, P., and Kaeli, D. 2010. Static memory access pattern analysis on a massively parallel GPU. In Proceedings of PLDI. ACM.Google ScholarGoogle Scholar
  37. Karrenberg, R. and Hack, S. 2011. Whole-function vectorization. In Proceedings of the 9th IEEE/ACM CGO. IEEE, 141--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Karrenberg, R. and Hack, S. 2012. Improving performance of OpenCL on CPUS. In Proceedings of CC. Springer, 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Keryell, R., Materat, P., and Paris, N. 1991. POMP, or how to design a massively parallel machine with small developments. In Proceedings of Parallel Architectures and Languages Europe (PARLE). Lecture Note in Computer Science, vol. 505, Springer, 83--100.Google ScholarGoogle ScholarCross RefCross Ref
  40. Kung, S.-Y., Arun, K. S., Gal-Ezer, R. J., and Bhaskar Rao, D. V. 1982. Wavefront array processor: Language, architecture, and applications. IEEE Trans. Comput. 31, 1054--1066. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Lashgar, A. and Baniasadi, A. 2011. Performance in GPU architectures: Potentials and distances. In Proceedings of the 9th Workshop on Duplicating, Deconstructing, and Debunking (WDDD). IEEE, 75--81.Google ScholarGoogle Scholar
  42. Lawrie, D. H., Layman, T., Baer, D., and Randal, J. M. 1975. GLYPNIR-a programming language for ILLIAC IV. Comm. ACM 18, 3, 157--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Lee, S., Min, S.-J., and Eigenmann, R. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of ACM SIGPLAN PPoPP. ACM, 101--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Lee, Y., Avizienis, R., Bishara, A., Xia, R., Lockhart, D., Batten, C., and Asanovic, K. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA). ACM, 129--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Lee, Y., Krashinsky, R., Grover, V., Keckler, S. W., and Asanovic, K. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the IEEE/ACM CGO. ACM, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Leiba, R., Hack, S., and Wald, I. 2012. Extending a C-like language for portable SIMD programming. In Proceedings of ACM SIGPLAN PPOPP. ACM, 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Meng, J., Tarjan, D., and Skadron, K. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the International Symposium on Computer Architecture (ISCA). ACM, 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Miné, A. 2006. The octagon abstract domain. Higher Order Symbol. Comput. 19, 31--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Mu, S., Zhang, X., Zhang, N., Lu, J., Deng, Y. S., and Zhang, S. 2010. IP routing processing with graphic processors. In Proceedings of DATE. IEEE, 93--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Nickolls, J. and Dally, W. J. 2010. The GPU computing era. IEEE Micro 30, 2, 56--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Nickolls, J. and Kirk, D. 2009. Graphics and Computing GPUs. Computer Organization and Design, (Patterson and Hennessy) 4th Ed. Elsevier, Chapter A, A.1--A.77.Google ScholarGoogle Scholar
  52. Nielson, F., Nielson, H. R., and Hankin, C. 2005. Principles of Program Analysis. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Ottenstein, K. J., Ballance, R. A., and MacCabe, A. B. 1990. The program dependence Web: A representation supporting control-, data-, and demand-driven interpretation of imperative languages. In Proceedings of ACM SIGPLAN PLDI. ACM, 257--271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Pereira, F. M. Q. 2011. Dirmap's Blog. http://divmap.wordpress.com/.Google ScholarGoogle Scholar
  55. Perrot, R. H. 1979. A language for array and vector processors. ACM Trans. Program. Lang. Syst. 1, 177--195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Pharr, M. and Mark, W. R. 2012. ISPC: A SPMD compiler for high-performance CPU programming. In Proceedings of Innorative Parallel Computing (InPar).Google ScholarGoogle Scholar
  57. Poletto, M. and Sarkar, V. 1999. Linear scan register allocation. ACM Trans. Program. Lang. Syst. 21, 5, 895--913. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Prabhu, T., Ramalingam, S., Might, M., and Hall, M. 2011. EigenCFA: Accelerating flow analysis with GPUs. In Proceedings of POPL. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and mei W. Hwu, W. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of ACM SIGPLAN PPoPP. ACM, 73--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., and Mendelson, A. 2009. Programming model for a heterogeneous ×86 platform. In Proceedings of ACM SIGPLAN PLDI. ACM, 431--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Samadi, M., Hormati, A., Mehrara, M., and Mahlke, S. 2012. Adaptive input-aware compilation for graphics engines. In Proceedings of ACM SIGPLAN PLDI. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Sampaio, D., Martins, R., Collange, S., and Pereira, F. M. Q. 2012a. Divergence analysis with affine constraints. In Proceedings of the 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 137--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Sampaio, D. N., Gedeon, E., Pereira, F. M. Q., and Collange, S. 2012b. Spill code placement for simd machines. In Proceedings of the 16th Brazilian Symposium on Language Programmings (SBLP). Lecture Notes on Computer Science, vol. 7554, Springer, 12--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Sandes, E. F. O. and de Melo, A. C. M. 2010. CUDALIGN: Using GPU to accelerate the comparison of megabase genomic sequences. In Proceedings of ACM SIGPLAN PPoPP. ACM, 137--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Scholz, B., Zhang, C., and Cifuentes, C. 2008. User-input dependence analysis via graph reachability. Tech. rep., Sun, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Stratton, J. A., Grover, V., Marathe, J., Aarts, B., Murphy, M., Hu, Z., and Hwu, W.-m. W. 2010. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In Proceedings of IEEE/ACM CGO. IEEE, 111--119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Stratton, J. A., Rodrigues, C., Sun, I.-J., Obeid, N., Chang, L.-W., Anssari, N., Liu, G. D., and mei W. Hwu, W. 2012. The parboil report. Tech. rep., University of Illinois at Urbana-Champaign.Google ScholarGoogle Scholar
  68. Tu, P. and Padua, D. 1995. Efficient building and placing of gating functions. In Proceedings of ACM SIGPLAN PLDI. ACM, 47--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Weiser, M. 1981. Program slicing. In Proceedings of ICSE. IEEE, 439--449. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Yang, Y., Xiang, P., Kong, J., and Zhou, H. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of ACM SIGPLAN PLDI. ACM, 86--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Zhang, E. Z., Jiang, Y., Guo, Z., and Shen, X. 2010. Streamlining GPU applications on the fly: Thread divergence elimination through runtime thread-data remapping. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS). ACM, 115--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Zhang, E. Z., Jiang, Y., Guo, Z., Tian, K., and Shen, X. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of ASPLOS. ACM, 369--380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Zhang, Y. and Owens, J. D. 2011. A quantitative performance analysis model for GPU architectures. In Proceedings of the IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). ACM, 382--393. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Divergence analysis

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Programming Languages and Systems
      ACM Transactions on Programming Languages and Systems  Volume 35, Issue 4
      December 2013
      169 pages
      ISSN:0164-0925
      EISSN:1558-4593
      DOI:10.1145/2560142
      Issue’s Table of Contents

      Copyright © 2014 ACM

      © 2014 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 January 2014
      • Accepted: 1 August 2013
      • Revised: 1 June 2013
      • Received: 1 December 2012
      Published in toplas Volume 35, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader