research-article

Open Access

Divergence analysis

Authors:
Diogo Sampaio

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Rafael Martins de Souza

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Caroline Collange

INRIA, Rennes Cedex, France

INRIA, Rennes Cedex, France
View Profile

,
Fernando Magno Quintão Pereira

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
View Profile

ACM Transactions on Programming Languages and Systems Volume 35 Issue 4Article No.: 13pp 1–36https://doi.org/10.1145/2523815

Published:03 January 2014Publication History

ACM Transactions on Programming Languages and Systems

Abstract

Growing interest in graphics processing units has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers tremendous computational power; however, programming them is still challenging. In particular, developers must deal with memory and control-flow divergences. These phenomena stem from a condition that we call data divergence, which occurs whenever two processing elements (PEs) see the same variable name holding different values. This article introduces divergence analysis, a static analysis that discovers data divergences. This analysis, currently deployed in an industrial quality compiler, is useful in several ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the automatic optimization of SIMD programs. We demonstrate this last point by introducing the notion of a divergence-aware register spiller. This spiller uses information from our analysis to either rematerialize or share common data between PEs. As a testimony of its effectiveness, we have tested it on a suite of 395 CUDA kernels from well-known benchmarks. The divergence-aware spiller produces GPU code that is 26.21% faster than the code produced by the register allocator used in the baseline compiler.

References

Abel, N. E., Budnik, P. P., Kuck, D. J., Muraoka, Y., Northcote, R. S., and Wilhelmson, R. B. 1969. TRANQUIL: A language for an array processing computer. In Proceedings of AFIPS. ACM, 57--73. Google ScholarDigital Library
Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. 2006. Compilers: Principles, Techniques, and Tools 2nd Ed. Addison Wesley. Google ScholarDigital Library
Aiken, A. and Gay, D. 1998. Barrier inference. In Proceedings of POPL. ACM, 342--354. Google ScholarDigital Library
Appel, A. W. 1998. SSA is functional programming. SIGPLAN Not. 33, 4, 17--20. Google ScholarDigital Library
Baghsorkhi, S. S., Delahaye, M., Patel, S. J., Gropp, W. D., and Hwu, W.-M. W. 2010. An adaptive performance modeling tool for GPU architectures. In Proceedings of PPoPP. ACM, 105--114. Google ScholarDigital Library
Belady, L. A. 1966. A study of replacement algorithms for a virtual storage computer. IBM Syst. J. 5, 2, 78--101. Google ScholarDigital Library
Blelloch, G. and Chatterjee, S. 1990. Vcode: A data-parallel intermediate language. In Proceedings of FMPC. ACM, 471--480.Google Scholar
Boudier, P. and Sellers, G. 2011. Memory system on Fusion APUs. In Proceedings of the AMD Fusion Developer Summit. AMD.Google Scholar
Bougé, L. and Levaire, J.-L. 1992. Control structures for data-parallel SIMD languages: Semantics and implementation. Future Gen. Comput. Syst. 8, 4, 363--378. Google ScholarDigital Library
Bouknight, W., Denenberg, S. A., McIntyre, D. E., Randall, J. M., Sameh, A. H., and Slotnick, D. L. 1972. The Illiac IV system. IEEE 60, 4, 369--388.Google ScholarCross Ref
Briggs, P., Cooper, K. D., and Torczon, L. 1992. Rematerialization. In Proceedings of PLDI. ACM, 311--321. Google ScholarDigital Library
Brockmann, K. and Wanka, R. 1997. Efficient oblivious parallel sorting on the MasPar MP-1. Vol. ICSS. 1, 200. Google ScholarDigital Library
Budimlic, Z., Cooper, K. D., Harvey, T. J., Kennedy, K., Oberg, T. S., and Reeves, S. W. 2002. Fast copy coalescing and live-range identification. In Proceedings of PLDI. ACM, 25--32. Google ScholarDigital Library
Carrillo, S., Siegel, J., and Li, X. 2009. A control-structure splitting optimization for GPGPU. In Proceedings of the 6th ACM Conference Computing Frontiers. ACM, 147--150. Google ScholarDigital Library
Cederman, D. and Tsigas, P. 2009. GPU-quicksort: A practical quicksort algorithm for graphics processors. J. Exp. Algorithmics 14, 1, 4--24. Google ScholarDigital Library
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.-H., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). IEEE, 44--54. Google ScholarDigital Library
Choi, J.-D., Cytron, R., and Ferrante, J. 1991. Automatic construction of sparse data flow evaluation graphs. In Proceedings of POPL. ACM, 55--66. Google ScholarDigital Library
Collange, S., Defour, D., and Zhang, Y. 2009. Dynamic detection of uniform and affine vectors in GPGPU computations. In Proceedings of the Parallel Processing Workshop (HPPC). Lecture Notes in computer science, vol. 6043, Springer, 46--55. Google ScholarDigital Library
Cousot, P. and Halbwachs, N. 1978. Automatic discovery of linear restraints among variables of a program. In Proceedings of POPL. ACM, 84--96. Google ScholarDigital Library
Coutinho, B., Sampaio, D., Pereira, F. M. Q., and Meira, W. 2011. Divergence analysis and optimizations. In Proceedings of International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, 320--329. Google ScholarDigital Library
Coutinho, B., Sampaio, D., Pereira, F. M. Q., and Meira, W. 2013. Profiling divergences in GPU applications. Concur. Comput. Pract. Exp. 25, 6, 775--789.Google ScholarCross Ref
Cunningham, D., Bordawekar, R., and Saraswat, V. 2011. GPU programming in a high level language: Compiling ×10 to CUDA. In Proceedings of the ACM SIGPLAN X10 Workshop. ACM, 8:1--8:10. Google ScholarDigital Library
Cytron, R., Ferrante, J., Rosen, B. K., Wegman, M. N., and Zadeck, F. K. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13, 4, 451--490. Google ScholarDigital Library
Darema, F., George, D. A., Norton, V. A., and Pfister, G. F. 1988. A single-program-multiple-data computational model for EPEX/FORTRAN. Parallel Comput. 7, 1, 11--24.Google ScholarCross Ref
Diamos, G., Kerr, A., Yalamanchili, S., and Clark, N. 2010. Ocelot, a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, 354--364. Google ScholarDigital Library
Dubach, C., Cheng, P., Rabbah, R., Bacon, D. F., and Fink, S. J. 2012. Compiling a high-level language for GPUS: (via language support for architectures and compilers). In Proceedings of ACM SIGPLAN PLDI. ACM, 1--12. Google ScholarDigital Library
Farrell, C. A. and Kieronska, D. H. 1996. Formal specification of parallel SIMD execution. Theo. Comp. Science 169, 1, 39--65. Google ScholarDigital Library
Ferrante, J., Ottenstein, K. J., and Warren, J. D. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3, 319--349. Google ScholarDigital Library
Flynn, M. 1972. Some computer organizations and their effectiveness. IEEE Trans. Comput. 21, 9, 948--960. Google ScholarDigital Library
Fung, W. W. L., Sham, I., Yuan, G., and Aamodt, T. M. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of IEEE/ACM MICRO. IEEE, 407--420. Google ScholarDigital Library
Garland, M. and Kirk, D. B. 2010. Understanding throughput-oriented architectures. Comm. ACM 53, 58--66. Google ScholarDigital Library
Gou, C. and Gaydadjiev, G. 2013. Addressing GPU on-chip shared memory bank conflicts using elastic pipeline. Int. J. Parallel Program. 41, 3, 400--429.Google ScholarCross Ref
Grover, V., Joannes, B., Aarts, M., and Murphy, M. 2009. Variance analysis for translating CUDA code for execution by a general purpose processor. U.S. Patent 2009/0259997, Filed March 31, 2009, and issued October 15, 2009.Google Scholar
Hack, S. and Goos, G. 2006. Optimal register allocation for SSA-form programs in polynomial time. Inf. Process. Lett. 98, 4, 150--155. Google ScholarDigital Library
Han, T. D. and Abdelrahman, T. S. 2011. Reducing branch divergence in GPU programs. In Proceedings of GPGPU-4. ACM, 3:1--3:8. Google ScholarDigital Library
Jang, B., Schaa, D., Mistry, P., and Kaeli, D. 2010. Static memory access pattern analysis on a massively parallel GPU. In Proceedings of PLDI. ACM.Google Scholar
Karrenberg, R. and Hack, S. 2011. Whole-function vectorization. In Proceedings of the 9th IEEE/ACM CGO. IEEE, 141--150. Google ScholarDigital Library
Karrenberg, R. and Hack, S. 2012. Improving performance of OpenCL on CPUS. In Proceedings of CC. Springer, 1--20. Google ScholarDigital Library
Keryell, R., Materat, P., and Paris, N. 1991. POMP, or how to design a massively parallel machine with small developments. In Proceedings of Parallel Architectures and Languages Europe (PARLE). Lecture Note in Computer Science, vol. 505, Springer, 83--100.Google ScholarCross Ref
Kung, S.-Y., Arun, K. S., Gal-Ezer, R. J., and Bhaskar Rao, D. V. 1982. Wavefront array processor: Language, architecture, and applications. IEEE Trans. Comput. 31, 1054--1066. Google ScholarDigital Library
Lashgar, A. and Baniasadi, A. 2011. Performance in GPU architectures: Potentials and distances. In Proceedings of the 9th Workshop on Duplicating, Deconstructing, and Debunking (WDDD). IEEE, 75--81.Google Scholar
Lawrie, D. H., Layman, T., Baer, D., and Randal, J. M. 1975. GLYPNIR-a programming language for ILLIAC IV. Comm. ACM 18, 3, 157--164. Google ScholarDigital Library
Lee, S., Min, S.-J., and Eigenmann, R. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of ACM SIGPLAN PPoPP. ACM, 101--110. Google ScholarDigital Library
Lee, Y., Avizienis, R., Bishara, A., Xia, R., Lockhart, D., Batten, C., and Asanovic, K. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA). ACM, 129--140. Google ScholarDigital Library
Lee, Y., Krashinsky, R., Grover, V., Keckler, S. W., and Asanovic, K. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the IEEE/ACM CGO. ACM, 1--11. Google ScholarDigital Library
Leiba, R., Hack, S., and Wald, I. 2012. Extending a C-like language for portable SIMD programming. In Proceedings of ACM SIGPLAN PPOPP. ACM, 65--74. Google ScholarDigital Library
Meng, J., Tarjan, D., and Skadron, K. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the International Symposium on Computer Architecture (ISCA). ACM, 235--246. Google ScholarDigital Library
Miné, A. 2006. The octagon abstract domain. Higher Order Symbol. Comput. 19, 31--100. Google ScholarDigital Library
Mu, S., Zhang, X., Zhang, N., Lu, J., Deng, Y. S., and Zhang, S. 2010. IP routing processing with graphic processors. In Proceedings of DATE. IEEE, 93--98. Google ScholarDigital Library
Nickolls, J. and Dally, W. J. 2010. The GPU computing era. IEEE Micro 30, 2, 56--69. Google ScholarDigital Library
Nickolls, J. and Kirk, D. 2009. Graphics and Computing GPUs. Computer Organization and Design, (Patterson and Hennessy) 4th Ed. Elsevier, Chapter A, A.1--A.77.Google Scholar
Nielson, F., Nielson, H. R., and Hankin, C. 2005. Principles of Program Analysis. Springer. Google ScholarDigital Library
Ottenstein, K. J., Ballance, R. A., and MacCabe, A. B. 1990. The program dependence Web: A representation supporting control-, data-, and demand-driven interpretation of imperative languages. In Proceedings of ACM SIGPLAN PLDI. ACM, 257--271. Google ScholarDigital Library
Pereira, F. M. Q. 2011. Dirmap's Blog. http://divmap.wordpress.com/.Google Scholar
Perrot, R. H. 1979. A language for array and vector processors. ACM Trans. Program. Lang. Syst. 1, 177--195. Google ScholarDigital Library
Pharr, M. and Mark, W. R. 2012. ISPC: A SPMD compiler for high-performance CPU programming. In Proceedings of Innorative Parallel Computing (InPar).Google Scholar
Poletto, M. and Sarkar, V. 1999. Linear scan register allocation. ACM Trans. Program. Lang. Syst. 21, 5, 895--913. Google ScholarDigital Library
Prabhu, T., Ramalingam, S., Might, M., and Hall, M. 2011. EigenCFA: Accelerating flow analysis with GPUs. In Proceedings of POPL. ACM. Google ScholarDigital Library
Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and mei W. Hwu, W. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of ACM SIGPLAN PPoPP. ACM, 73--82. Google ScholarDigital Library
Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., and Mendelson, A. 2009. Programming model for a heterogeneous ×86 platform. In Proceedings of ACM SIGPLAN PLDI. ACM, 431--440. Google ScholarDigital Library
Samadi, M., Hormati, A., Mehrara, M., and Mahlke, S. 2012. Adaptive input-aware compilation for graphics engines. In Proceedings of ACM SIGPLAN PLDI. ACM. Google ScholarDigital Library
Sampaio, D., Martins, R., Collange, S., and Pereira, F. M. Q. 2012a. Divergence analysis with affine constraints. In Proceedings of the 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 137--146. Google ScholarDigital Library
Sampaio, D. N., Gedeon, E., Pereira, F. M. Q., and Collange, S. 2012b. Spill code placement for simd machines. In Proceedings of the 16th Brazilian Symposium on Language Programmings (SBLP). Lecture Notes on Computer Science, vol. 7554, Springer, 12--26. Google ScholarDigital Library
Sandes, E. F. O. and de Melo, A. C. M. 2010. CUDALIGN: Using GPU to accelerate the comparison of megabase genomic sequences. In Proceedings of ACM SIGPLAN PPoPP. ACM, 137--146. Google ScholarDigital Library
Scholz, B., Zhang, C., and Cifuentes, C. 2008. User-input dependence analysis via graph reachability. Tech. rep., Sun, Inc. Google ScholarDigital Library
Stratton, J. A., Grover, V., Marathe, J., Aarts, B., Murphy, M., Hu, Z., and Hwu, W.-m. W. 2010. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In Proceedings of IEEE/ACM CGO. IEEE, 111--119. Google ScholarDigital Library
Stratton, J. A., Rodrigues, C., Sun, I.-J., Obeid, N., Chang, L.-W., Anssari, N., Liu, G. D., and mei W. Hwu, W. 2012. The parboil report. Tech. rep., University of Illinois at Urbana-Champaign.Google Scholar
Tu, P. and Padua, D. 1995. Efficient building and placing of gating functions. In Proceedings of ACM SIGPLAN PLDI. ACM, 47--55. Google ScholarDigital Library
Weiser, M. 1981. Program slicing. In Proceedings of ICSE. IEEE, 439--449. Google ScholarDigital Library
Yang, Y., Xiang, P., Kong, J., and Zhou, H. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of ACM SIGPLAN PLDI. ACM, 86--97. Google ScholarDigital Library
Zhang, E. Z., Jiang, Y., Guo, Z., and Shen, X. 2010. Streamlining GPU applications on the fly: Thread divergence elimination through runtime thread-data remapping. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS). ACM, 115--126. Google ScholarDigital Library
Zhang, E. Z., Jiang, Y., Guo, Z., Tian, K., and Shen, X. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of ASPLOS. ACM, 369--380. Google ScholarDigital Library
Zhang, Y. and Owens, J. D. 2011. A quantitative performance analysis model for GPU architectures. In Proceedings of the IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). ACM, 382--393. Google ScholarDigital Library

Index Terms

Divergence analysis
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Pointer-Based Divergence Analysis for OpenCL 2.0 Programs
A modern GPU is designed with many large thread groups to achieve a high throughput and performance. Within these groups, the threads are grouped into fixed-size SIMD batches in which the same instruction is applied to vectors of data in a lockstep. This ...
Read More
Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads
Highlights
- We propose a new thread synchronization heuristic called Min-SP/PC.
- Min-SP/PC ...
Abstract
Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same processing unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is one ...
Read More
Divergence Analysis with Affine Constraints
SBAC-PAD '12: Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

The rising popularity of graphics processing units is bringing renewed interest in code optimization techniques for SIMD processors. Many of these optimizations rely on divergence analyses, which classify variables as uniform, if they have the same ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Programming Languages and Systems Volume 35, Issue 4
December 2013
169 pages
ISSN:0164-0925
EISSN:1558-4593
DOI:10.1145/2560142
Issue’s Table of Contents

Copyright © 2014 ACM
© 2014 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 January 2014
- Accepted: 1 August 2013
- Revised: 1 June 2013
- Received: 1 December 2012
Published in toplas Volume 35, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
SIMD
Static program analysis
divergence analysis
graphics processing units
high performance
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 905
  Total Downloads
- Downloads (Last 12 months)171
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Divergence analysis

ACM Transactions on Programming Languages and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Pointer-Based Divergence Analysis for OpenCL 2.0 Programs

Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads

Divergence Analysis with Affine Constraints

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Divergence analysis

ACM Transactions on Programming Languages and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Pointer-Based Divergence Analysis for OpenCL 2.0 Programs

Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads

Divergence Analysis with Affine Constraints

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media