research-article

Open access

The Impact of the SIMD Width on Control-Flow and Memory Divergence

Authors:

Ralf Karrenberg,

Sebastian HackAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 11, Issue 4

Article No.: 54, Pages 1 - 25

https://doi.org/10.1145/2687355

Published: 09 January 2015 Publication History

Abstract

Power consumption is a prevalent issue in current and future computing systems. SIMD processors amortize the power consumption of managing the instruction stream by executing the same instruction in parallel on multiple data. Therefore, in the past years, the SIMD width has steadily increased, and it is not unlikely that it will continue to do so. In this article, we experimentally study the influence of the SIMD width to the execution of data-parallel programs. We investigate how an increasing SIMD width (up to 1024) influences control-flow divergence and memory-access divergence, and how well techniques to mitigate them will work on larger SIMD widths. We perform our study on 76 OpenCL applications and show that a group of programs scales well up to SIMD width 1024, whereas another group of programs increasingly suffers from control-flow divergence. For those programs, thread regrouping techniques may become increasingly important for larger SIMD widths. We show what average speedups can be expected when increasing the SIMD width. For example, when switching from scalar execution to SIMD width 64, one can expect a speedup of 60.11, which increases to 62.46 when using thread regrouping. We also analyze the frequency of regular (uniform, consecutive) memory access patterns and observe a monotonic decrease of regular memory accesses from 82.6 at SIMD width 4 to 43.1% at SIMD width 1024.

References

[1]

Sara S. Baghsorkhi, Matthieu Delahaye, Sanjay J. Patel, William D. Gropp, and Wen-mei W. Hwu. 2010. An adaptive performance modeling tool for GPU architectures. In Proceedings of PoPP. ACM, New York, NY, 105--114.

Digital Library

[2]

Markus Billeter, Ola Olsson, and Ulf Assarsson. 2009. Efficient stream compaction on wide SIMD many-core architectures. In Proceedings of HPG. ACM, New York, NY, 159--166.

Digital Library

[3]

Michael Boyer, Kevin Skadron, and Westley Weimer. 2008. Automated dynamic analysis of CUDA programs. In Proceedings of STMCS.

[4]

Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In Proceedings of ISWC. 141--151.

Digital Library

[5]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of IISWC. IEEE, Los Alamitos, CA, 44--54.

Digital Library

[6]

Peter Collingbourne, Cristian Cadar, and Paul H. J. Kelly. 2012. Symbolic testing of OpenCL code. In Proceedings of HVC. Springer-Verlag, Berlin, Heidelberg, 203--218.

Digital Library

[7]

Bruno Coutinho, Diogo Sampaio, Fernando Magno Quintão Pereira, and Wagner Meira Jr. 2011. Divergence analysis and optimizations. In Proceedings of PACT. 320--329.

Digital Library

[8]

Bruno Coutinho, Diogo Sampaio, Fernando M. Q. Pereira, and Wagner Meira Jr. 2010. Performance debugging of GPGPU applications with the divergence map. In Proceedings of SBAC-PAD. IEEE, Los Alamitos, CA, 33--40.

Digital Library

[9]

Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of CPGPU. ACM, New York, NY, 63--74.

Digital Library

[10]

Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of HPCA. 25--36.

Digital Library

[11]

Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of MICRO. IEEE, Los Alamitos, CA, 407--420.

Digital Library

[12]

Dominik Grewe and Michael F. P. O’Boyle. 2011. A static task partitioning approach for heterogeneous systems using OpenCL. In Proceedings of CC. Springer-Verlag, Berlin, Heidelberg.

Digital Library

[13]

Tianyi David Han and Tarek S. Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of GPGPU. ACM, New York, NY, Article No. 3.

Digital Library

[14]

Daniel Horn. 2005. Stream reduction operations for GPGPU applications. In GPU Gems 2. Addison-Wesley, 573--589.

[15]

Wen-Mei Hwu, Christopher Rodrigues, Shane Ryoo, and John Stratton. 2009. Compute unified device architecture application suitability. Computing in Science and Engineering 11, 3, 16--26.

Digital Library

[16]

Ralf Karrenberg and Sebastian Hack. 2011. Whole function vectorization. In Proceedings of CGO. 141--150.

Digital Library

[17]

Ralf Karrenberg and Sebastian Hack. 2012. Improving performance of OpenCL on CPUs. In Compiler Construction. Lecture Notes in Computer Science, Vol. 7210. Springer, 1--20.

Digital Library

[18]

Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. 2009. A characterization and analysis of PTX kernels. In Proceedings of IISWC. IEEE, Los Alamitos, CA, 3--12.

Digital Library

[19]

Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. 2012. Dynamic compilation of data-parallel kernels for vector processors. In Proceedings of CGO. ACM, New York, NY, 23--32.

Digital Library

[20]

Yooseong Kim and Aviral Shrivastava. 2011. CuMAPz: A tool to analyze memory access patterns in CUDA. In Proceedings of DAC. 128--133.

Digital Library

[21]

Ahmad Lashgar, Amirali Baniasadi, and Ahmad Khonsari. 2013. Warp size impact in GPUs: Large or small&quest; In Proceedings of GPGPU. ACM, New York, NY, 146--152.

Digital Library

[22]

Alberto Magni, Christophe Dubach, and Michael F. P. O’Boyle. 2013. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of SC. ACM, New York, NY, Article No. 11.

Digital Library

[23]

David Maier. 1978. The complexity of some problems on subsequences and supersequences. Journal of the ACM 25, 2, 322--336.

Digital Library

[24]

Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of ISCA. ACM, New York, NY, 235--246.

Digital Library

[25]

Perhaad Mistry, Chris Gregg, Norman Rubin, David Kaeli, and Kim Hazelwood. 2011. Analyzing program flow within a many-kernel OpenCL application. In Proceedings of CPGPU. ACM, New York, NY, Article No. 10.

Digital Library

[26]

Minsoo Rhu and Mattan Erez. 2013. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In Proceedings of ISCA. 356--367.

Digital Library

[27]

Diogo Sampaio, Rafael Martins de Souza, Sylvain Collange, and Fernando Magno Quintão Pereira. 2014. Divergence analysis. ACM Transactions on Programming Languages and Systems 35, 4, Article No. 13.

[28]

Martin Sandrieser, Siegfried Benkner, and Sabri Pllana. 2011. Improving programmability of heterogeneous many-core systems via explicit platform descriptions. In Proceedings of IWMSE. ACM, New York, NY, 17--24.

Digital Library

[29]

John Sartori and Rakesh Kumar. 2012. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. In Proceedings of PACT. ACM, New York, NY, 427--428.

Digital Library

[30]

Jaewook Shin. 2007. Introducing control flow into vectorized code. In Proceedings of PACT. IEEE, Los Alamitos, CA, 280--291.

Digital Library

[31]

John A. Stratton, Christopher Rodrigrues, I-Jui Sung, Nady Obeid, Liwen Chang, Geng Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign, Champaign, IL.

[32]

Ingo Wald. 2011. Active thread compaction for GPU path tracing. In Proceedings of HPG. ACM, New York, NY, 51--58.

Digital Library

[33]

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of ASPLOS XVI. ACM, New York, NY, 369--380.

Digital Library

[34]

Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. 2011. GRace: A low-overhead mechanism for detecting data races in GPU programs. In Proceedings of PPoPP. ACM, New York, NY, 135--146.

Digital Library

Cited By

Wang PYu Z(2023)RenderBench: The CPU Rendering Benchmark Suite Based on Microarchitecture-Independent CharacteristicsElectronics10.3390/electronics1219415312:19(4153)Online publication date: 6-Oct-2023
https://doi.org/10.3390/electronics12194153
Saumya CSundararajah KKulkarni MLee J(2022)DARMProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741285(28-40)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741285
Sun HFey FZhao JGorlatch SEigenmann RDing CMcKee S(2019)WCCVProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3331059(319-329)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3331059
Show More Cited By

Index Terms

The Impact of the SIMD Width on Control-Flow and Memory Divergence
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

Retargetable code optimization with SIMD instructions
CODES+ISSS '06: Proceedings of the 4th international conference on Hardware/software codesign and system synthesis

Retargetable C compilers are nowadays widely used to quickly obtain compiler support for new embedded processors and to perform early processor architecture exploration. One frequent concern about retargetable compilers, though, is their lack of machine-...
SIMD-based soft error detection
CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Soft error rates in processors have been increasing with decreasing feature size and larger chips. Software-only solutions have been proposed to deal with this problem, for instance via instruction duplication. However, this leads to significant ...
SIMD programming using Intel vector extensions
Abstract
Single instruction multiple data (SIMD) extensions are one of the most significant capabilities of recent General Purpose Processors (GPPs) which improves the performance of applications with less hardware modification. Each GPP vendor ...
Highlights
- We provide a review of SIMD technologies in general and Intel’s SIMD in particular.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 11, Issue 4

January 2015

797 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2695583

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2015

Accepted: 01 October 2014

Revised: 01 August 2014

Received: 01 May 2014

Published in TACO Volume 11, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Intel Visual Computing Institute Saarbrücken
ECOUSS project
German Federal Ministry of Education and Research (BMBF)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
1,196
Total Downloads

Downloads (Last 12 months)192
Downloads (Last 6 weeks)17

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang PYu Z(2023)RenderBench: The CPU Rendering Benchmark Suite Based on Microarchitecture-Independent CharacteristicsElectronics10.3390/electronics1219415312:19(4153)Online publication date: 6-Oct-2023
https://doi.org/10.3390/electronics12194153
Saumya CSundararajah KKulkarni MLee J(2022)DARMProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741285(28-40)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741285
Sun HFey FZhao JGorlatch SEigenmann RDing CMcKee S(2019)WCCVProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3331059(319-329)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3331059
Moreira RCollange CQuintão Pereira F(2017)Function Call Re-VectorizationACM SIGPLAN Notices10.1145/3155284.301875152:8(313-326)Online publication date: 26-Jan-2017
https://dl.acm.org/doi/10.1145/3155284.3018751
Huh JTuck JHunter HMoreno JEmer JSanchez D(2017)Improving the effectiveness of searching for isomorphic chains in superword level parallelismProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3124554(718-729)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3124554
Moreira RCollange CQuintão Pereira FSarkar VRauchwerger L(2017)Function Call Re-VectorizationProceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3018743.3018751(313-326)Online publication date: 26-Jan-2017
https://dl.acm.org/doi/10.1145/3018743.3018751
Piekenbrock MRobinson JBurchett LNykl SWoolley BTerzuoli A(2016)Automated aerial refueling: Parallelized 3D iterative closest point: Subject area: Guidance and control2016 IEEE National Aerospace and Electronics Conference (NAECON) and Ohio Innovation Summit (OIS)10.1109/NAECON.2016.7856797(188-192)Online publication date: Jul-2016
https://doi.org/10.1109/NAECON.2016.7856797
Daily JKalyanaraman AKrishnamoorthy SRen B(2016)On the Impact of Widening Vector Registers on Sequence Alignment2016 45th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2016.65(506-515)Online publication date: Aug-2016
https://doi.org/10.1109/ICPP.2016.65
Anderson AMalik AGregg D(2015)Automatic Vectorization of Interleaved Data RevisitedACM Transactions on Architecture and Code Optimization10.1145/283873512:4(1-25)Online publication date: 8-Dec-2015
https://dl.acm.org/doi/10.1145/2838735
Khorasani FGupta RBhuyan LPrvulovic M(2015)Efficient warp execution in presence of divergence with collaborative context collectionProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830796(204-215)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830796

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents