research-article

SIMD re-convergence at thread frontiers

Authors:

Gregory Diamos,

Benjamin Ashbaugh,

Subramaniam Maiyuran,

Sudhakar YalamanchiliAuthors Info & Claims

MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 477 - 488

https://doi.org/10.1145/2155620.2155676

Published: 03 December 2011 Publication History

Abstract

Hardware and compiler techniques for mapping data-parallel programs with divergent control flow to SIMD architectures have recently enabled the emergence of new GPGPU programming models such as CUDA, OpenCL, and DirectX Compute. The impact of branch divergence can be quite different depending upon whether the program's control flow is structured or unstructured. In this paper, we show that unstructured control flow occurs frequently in applications and can lead to significant code expansion when executed using existing approaches for handling branch divergence.

This paper proposes a new technique for automatically mapping arbitrary control flow onto SIMD processors that relies on a concept of a Thread Frontier, which is a bounded region of the program containing all threads that have branched away from the current warp. This technique is evaluated on a GPU emulator configured to model i) a commodity GPU (Intel Sandybridge), and ii) custom hardware support not realized in current GPU architectures. It is shown that this new technique performs identically to the best existing method for structured control flow, and re-converges at the earliest possible point when executing unstructured control flow. This leads to i) between 1.5 - 633.2% reductions in dynamic instruction counts for several real applications, ii) simplification of the compilation process, and iii) ability to efficiently add high level unstructured programming constructs (e.g., exceptions) to existing data-parallel languages.

References

[1]

T. Hatazaki, "Tsubame-2 - a 2.4 pflops peak performance system," in Optical Fiber Communication Conference. Optical Society of America, 2011.

[2]

K. Spafford, J. S. Meredith, and J. S. Vetter, "Quantifying numa and contention effects in multi-gpu systems," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ser. GPGPU-4. New York, NY, USA: ACM, 2011, pp. 11:1--11:7.

Digital Library

[3]

L. G. Valiant, "A bridging model for parallel computation," Commun. ACM, 1990.

Digital Library

[4]

H. Wu, G. Diamos, S. Li, and S. Yalamanchili, "Characterization and transformation of unstructured control flow in gpu applications," in The First International Workshop on Characterizing Applications for Heterogeneous Exascale Systems. ACM, June 2011.

[5]

A. Levinthal and T. Porter, "Chap - a simd graphics processor," SIGGRAPH Comput. Graph., vol. 18, no. 3, pp. 77--82, 1984.

Digital Library

[6]

W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient gpu control flow," in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007.

Digital Library

[7]

Intel, Intel HD Graphics OpenSource Programmer Reference Manual, June 2010.

[8]

H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, "Demystifying gpu microarchitecture through microbenchmarking," in Performance Analysis of Systems Software (ISPASS), 2010 IEEE International Symposium on, 2010, pp. 235--246.

[9]

Wen-mei W. Hwu, "Real-time adaptive gpu multi-agent path planning," in GPU Computing Gems, Wen-mei W. Hwu, Ed. Morgan Kaufmann, Sep. 2010, vol. 2.

[10]

C. Trapnell and M. C. Schatz, "Optimizing data intensive gpgpu computations for dna sequence alignment," Parallel Computing, vol. 35, pp. 429--440, August 2009.

Digital Library

[11]

A. Badal and A. Badano, "Accelerating monte carlo simulations of photon transport in a voxelized geometry using a massively parallel gpu," Medical Physics 36, 2009.

[12]

V. Pham, P. Vo, H. V. Thanh, and B. L. Hoai, "Gpu implementation of extended gaussian mixture model for background subtraction," International Conference on Computing and Telecommunication Technologies, 2010.

[13]

Q. Fang and D. A. Boas, "Monte carlo simulation of photon migration in 3d turbid media accelerated by graphics processing units," Optical Express 17, vol. 17, pp. 20 178--20 190.

[14]

T. Tsiodras, "A real-time raytracer of triangle meshes in cuda," Tech. Rep., February 2011. {Online}. Available: http://users.softlab.ntua.gr/~ttsiod/cudarenderer-BVH.html

[15]

S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich, "Optix: a general purpose ray tracing engine," ACM Transactions on Graphics, vol. 29, pp. 66:1--66:13, July 2010.

Digital Library

[16]

G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, "Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems," in Proceedings of PACT '10, 2010.

Digital Library

[17]

A. Kerr, G. Diamos, and S. Yalamanchili, "A characterization and analysis of ptx kernels," in IISWC09: IEEE International Symposium on Workload Characterization, Austin, TX, USA, October 2009.

Digital Library

[18]

W. Bouknight, S. Denenberg, D. McIntyre, J. Randall, A. Sameh, and D. Slotnick, "The illiac iv system," Proceedings of the IEEE, vol. 60, no. 4, pp. 369--388, apr. 1972.

[19]

AMD, Evergreen Family Instruction Set Architecture Instructions and Microcode, 2010.

[20]

F. Zhang and E. H. D'Hollander, "Using hammock graphs to structure programs," IEEE Trans. Softw. Eng., pp. 231--245, 2004.

Digital Library

[21]

J. Meng, D. Tarjan, and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance," in Proceedings of the 37th annual international symposium on Computer architecture, ser. ISCA '10. New York, NY, USA: ACM, 2010, pp. 235--246.

Digital Library

[22]

W. Fung and T. Aamodt, "Thread block compaction for efficient simt control flow," in 17th International Symposium on High Performance Computer Architecture, feb. 2011, pp. 25--36.

Digital Library

[23]

B. W. Coon and E. J. Lindholm, "System and method for managing divergent threads in a simd architecture," Patent US 7 353 369, April, 2008.

Cited By

Xekalaki MFumero JStratikopoulos ADoka KKatsakioris CBitsakos CKoziris NKotselidis C(2023)Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous HardwareProceedings of the VLDB Endowment10.14778/3565838.356584215:13(3869-3882)Online publication date: 20-Jan-2023
https://dl.acm.org/doi/10.14778/3565838.3565842
Banerjee ANagasamudram RNaumann DNikouei M(2023)A Relational Program Logic with Data Abstraction and Dynamic FramingACM Transactions on Programming Languages and Systems10.1145/355149744:4(1-136)Online publication date: 10-Jan-2023
https://dl.acm.org/doi/10.1145/3551497
Park JKim SLee JKo J(2023)Scalable Color Quantization for Task-centric Image CompressionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355138919:2s(1-18)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3551389
Show More Cited By

Index Terms

SIMD re-convergence at thread frontiers
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Thread shuffling: combining DVFS and thread migration toreduce energy consumptions for multi-core systems
ISLPED '11: Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design

In recent years, multi-core systems have become mainstream in computer industry. The design of multi-cores takes advantage of thread-level parallelism in emerging applications that are computationally intensive and highly parallel. Energy efficiency is ...
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
Effective SIMD vectorization for intel Xeon Phi coprocessors
Special issue on Programming Models, Languages, and Compilers for Manycore and Heterogeneous Architectures

Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

December 2011

519 pages

ISBN:9781450310536

DOI:10.1145/2155620

Conference Chair:
Carlo Galuzzi
Technische Universiteit Delft, The Netherlands
,
General Chair:
Luigi Carro
Universidade Federal do Rio Grande do Sul, Brasil
,
Program Chairs:
Andreas Moshovos
University of Toronto, Canada
,
Milos Prvulovic
Georgia Institute of Technology, United States

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE
ACM: Association for Computing Machinery
UFRGS: Universidade Federal do Rio Grande do Sul
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

MICRO-44

Sponsor:

ACM
UFRGS
SIGMICRO
IEEE-CS

MICRO-44: The 44th Annual IEEE/ACM International Symposium on Microarchitecture

December 3 - 7, 2011

Porto Alegre, Brazil

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

79
Total Citations
View Citations
618
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xekalaki MFumero JStratikopoulos ADoka KKatsakioris CBitsakos CKoziris NKotselidis C(2023)Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous HardwareProceedings of the VLDB Endowment10.14778/3565838.356584215:13(3869-3882)Online publication date: 20-Jan-2023
https://dl.acm.org/doi/10.14778/3565838.3565842
Banerjee ANagasamudram RNaumann DNikouei M(2023)A Relational Program Logic with Data Abstraction and Dynamic FramingACM Transactions on Programming Languages and Systems10.1145/355149744:4(1-136)Online publication date: 10-Jan-2023
https://dl.acm.org/doi/10.1145/3551497
Park JKim SLee JKo J(2023)Scalable Color Quantization for Task-centric Image CompressionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355138919:2s(1-18)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3551389
Shreeve BGralha CRashid AAraújo JGoulão M(2023)Making Sense of the Unknown: How Managers Make Cyber Security DecisionsACM Transactions on Software Engineering and Methodology10.1145/354868232:4(1-33)Online publication date: 27-May-2023
https://dl.acm.org/doi/10.1145/3548682
Brown NWeill-Tessier PSekula MCostache AKölling M(2022)Novice Use of the Java Programming LanguageACM Transactions on Computing Education10.1145/355139323:1(1-24)Online publication date: 29-Dec-2022
https://dl.acm.org/doi/10.1145/3551393
Nielsen SSkov MHansen KKaszowska A(2022)Using User-Generated YouTube Videos to Understand Unguided Interactions with Robots in Public PlacesACM Transactions on Human-Robot Interaction10.1145/355028012:1(1-40)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1145/3550280
Iliakis KXydis SSoudris D(2022)Repurposing GPU Microarchitectures with Light-Weight Out-Of-Order ExecutionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309323133:2(388-402)Online publication date: 1-Feb-2022
https://doi.org/10.1109/TPDS.2021.3093231
Damani SJohnson DStephenson MKeckler SYan EMcKeown MGiroux OMars JTang LXue JWu P(2020)Speculative reconvergence for improved SIMT efficiencyProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377911(121-132)Online publication date: 22-Feb-2020
https://dl.acm.org/doi/10.1145/3368826.3377911
Fioretto FDovier APontelli E(2019)Distributed multi-agent optimization for smart grids and home automationIntelligenza Artificiale10.3233/IA-18003712:2(67-87)Online publication date: 29-Jan-2019
https://doi.org/10.3233/IA-180037
Ríos JPaton NFernandes AAbel EKeane J(2019)Crowdsourced Targeted Feedback Collection for Multicriteria Data Source SelectionJournal of Data and Information Quality10.1145/328493411:1(1-27)Online publication date: 4-Jan-2019
https://dl.acm.org/doi/10.1145/3284934
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten