research-article

Dynamic thread block launch: a lightweight execution mechanism to support irregular applications on GPUs

Authors:

Albert Sidelnik,

Sudhakar YalamanchiliAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 43, Issue 3S

Pages 528 - 540

https://doi.org/10.1145/2872887.2750393

Published: 13 June 2015 Publication History

Abstract

GPUs have been proven effective for structured applications that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced nested device-side kernel launching functionality in the GPU is a step in the right direction, but still falls short of being able to effectively harness the GPUs performance potential.

We propose a new mechanism called Dynamic Thread Block Launch (DTBL) to extend the current bulk synchronous parallel model underlying the current GPU execution model by supporting dynamic spawning of lightweight thread blocks. This mechanism supports the nested launching of thread blocks rather than kernels to execute dynamically occurring parallel work elements. This paper describes the execution model of DTBL, device-runtime support, and microarchitecture extensions to track and execute dynamically spawned thread blocks. Experiments with a set of irregular data intensive CUDA applications executing on a cycle-level simulator show that DTBL achieves average 1.21x speedup over the original flat implementation and average 1.40x over the implementation with device-side kernel launches using CUDA Dynamic Parallelism.

References

[1]

"Global flight network." {Online}. Available: http://www.visualizing.org/datasets/global-flights-network

[2]

J. Adriaens, K. Compton, N. S. Kim, and M. Schulte, "The case for gpgpu spatial multitasking," in High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, 2012.

Digital Library

[3]

J. A. Anderson, C. D. Lorenz, and A. Travesset, "General purpose molecular dynamics simulations fully implemented on graphics processing units," Journal of Computational Physics, vol. 227, no. 10, pp. 5342--5359, 2008.

Digital Library

[4]

D. A. Bader, H. Meyerhenke, P. Sanders, and D. Wagner, "10th dimacs implementation challenge: Graph partitioning and graph clustering, 2011."

[5]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing cuda workloads using a detailed gpu simulator," in 2009 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS 2009), April 2009, pp. 163--174.

[6]

L. Bergstrom and J. Reppy, "Nested data-parallelism on the gpu," in ACM SIGPLAN Notices, vol. 47, no. 9. ACM, 2012, pp. 247--258.

Digital Library

[7]

M. Burtscher, R. Nasre, and K. Pingali, "A quantitative study of irregular programs on gpus," in Workload Characterization (IISWC), 2012 IEEE International Symposium on. IEEE, 2012, pp. 141--151.

Digital Library

[8]

M. Burtscher and K. Pingali, "An efficient cu da implementation of the tree-based barnes hut n-body algorithm," GPU computing Gems Emerald edition, p. 75, 2011.

[9]

S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, "Pannotia: Understanding irregular gpgpu graph applications," in Workload Characterization (IISWC), 2013 IEEE International Symposium on. IEEE, 2013, pp. 185--195.

[10]

J. Cohen and P. Castonguay, "Efficient graph matching and coloring on the gpu," in GPU Technology Conference, 2012.

[11]

B. W. Coon, J. R. Nickolls, J. E. Lindholm, R. J. Stoll, N. Wang, and J. H. Choquette, "Thread group scheduler for computing on a parallel thread processor," US Patent 8,732,713, 2014.

[12]

G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili, "Relational algorithms for multi-bulk-synchronous processors," in 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'13), February 2013.

Digital Library

[13]

W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient gpu control flow," in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2007, pp. 407--420.

Digital Library

[14]

B. R. Gaster and L. Howes, "Can gpgpu programming be liberated from the data-parallel bottleneck?" Computer, vol. 45, no. 8, pp. 42--52, 2012.

Digital Library

[15]

K. Gupta, J. A. Stuart, and J. D. Owens, "A study of persistent threads style gpu programming for gpgpu workloads," in Innovative Parallel Computing (InPar), 2012. IEEE, 2012, pp. 1--14.

[16]

J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, "An algorithmic framework for performing collaborative filtering," in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 1999.

Digital Library

[17]

Khronos, "The opencl specification version 2.0," 2014.

[18]

A. Kuhl, "Thermodynamic states in explosion fields," in 14th International Symposium on Detonation, Coeur d'Alene Resort, ID, USA, 2010.

[19]

M. Kulkarni, M. Burtscher, C. Casçaval, and K. Pingali, "Lonestar: A suite of parallel irregular programs," in ISPASS '09: IEEE International Symposium on Performance Analysis of Systems and Software, 2009.

[20]

H. Lee, K. Brown, A. Sujeeth, T. Rompf, and K. Olukotun, "Locality-aware mapping of nested parallel patterns on gpus," in the 47th International Symposium on Microarchitecture (MICRO '14), 2014.

Digital Library

[21]

J. McHugh, "Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory," ACM transactions on Information and system Security, vol. 3, no. 4, pp. 262--294, 2000.

Digital Library

[22]

J. Meng, D. Tarjan, and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance," in ACM SIGARCH Computer Architecture News, vol. 38, no. 3, 2010, pp. 235--246.

Digital Library

[23]

D. Merrill, M. Garland, and A. Grimshaw, "Scalable gpu graph traversal," in In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'12, 2012.

Digital Library

[24]

J. Mosegaard and T. S. Sørensen, "Real-time deformation of detailed geometry based on mappings to a less detailed physical simulation on the gpu," in Proceedings of the 11th Eurographics conference on Virtual Environments. Eurographics Association, 2005, pp. 105--111.

Digital Library

[25]

C. H. Nadungodage, Y. Xia, J. J. Lee, M. Lee, and C. S. Park, "Gpu accelerated item-based collaborative filtering for big-data applications," in Big Data, 2013 IEEE International Conference on. IEEE, 2013, pp. 175--180.

[26]

NVIDIA, "Hyperq sample," 2012.

[27]

NVIDIA, "Cuda c programming guide version 6.5," 2014.

[28]

NVIDIA, "Cuda dynamic parallelism programming guide," 2014.

[29]

M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood, "Fine-grain task aggregation and coordination on gpus," in Proceeding of the 41st Annual International Symposium on Computer Architecuture, ser. ISCA '14, 2014.

Digital Library

[30]

S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison et al., "Optix: a general purpose ray tracing engine," in ACM Transactions on Graphics (TOG), vol. 29, no. 4. ACM, 2010, p. 66.

Digital Library

[31]

V. Podlozhnyuk, "Black-scholes option pricing," Nvidia, 2007.

[32]

T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.

Digital Library

[33]

S. Solomon and P. Thulasiraman, "Performance study of mapping irregular computations on gpus," in Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1--8.

[34]

M. Steffen and J. Zambreno, "Improving simt efficiency of global rendering algorithms with architectural support for dynamic micro-kernels," in Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO '43, 2010.

Digital Library

[35]

I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero, "Enabling preemptive multiprogramming on gpus," in Proceeding of the 41st Annual International Symposium on Computer Architecuture, 2014.

Digital Library

[36]

J. Wang and S. Yalamanchili, "Characterization and analysis of dynamic parallelism in unstructured gpu applications," in 2014 IEEE International Symposium on Workload Characterization, October 2014.

[37]

L. Wang, S. Chen, Y. Tang, and J. Su, "Gregex: Gpu based high speed regular expression matching engine," in Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2011 Fifth International Conference on. IEEE, 2011, pp. 366--370.

Digital Library

[38]

H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili, "Kernel weaver: Automatically fusing database primitives for efficient gpu computation," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.

Digital Library

[39]

Y. Yang and H. Zhou, "Cuda-np: Realizing nested thread-level parallelism in gpgpu applications," in Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2014.

Digital Library

[40]

E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen, "On-the-fly elimination of dynamic irregularities for gpu computing," in ACM SIGARCH Computer Architecture News, vol. 39, no. 1, 2011, pp. 369--380.

Digital Library

Cited By

Abdolrashidi AEsfeden HJahanshahi ASingh KAbu-Ghazaleh NWong DMartínez JDuato JJohn L(2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00034
Duţu ASinclair MBeckmann BWood DChow MMartínez JDuato JEeckhout L(2020)Independent forward progress of work-groupsProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00087(1022-1035)Online publication date: 30-May-2020
https://dl.acm.org/doi/10.1109/ISCA45697.2020.00087
Muhammed TMehmood RAlbeshri AKatib I(2019)SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUsApplied Sciences10.3390/app90509479:5(947)Online publication date: 6-Mar-2019
https://doi.org/10.3390/app9050947
Show More Cited By

Index Terms

Dynamic thread block launch: a lightweight execution mechanism to support irregular applications on GPUs
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Hardware
  1. Electronic design automation
    1. Physical design (EDA)
  2. Hardware validation

Recommendations

Free launch: optimizing GPU dynamic kernel launches through thread reuse
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a ...
Dynamic thread block launch: a lightweight execution mechanism to support irregular applications on GPUs
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

GPUs have been proven effective for structured applications that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular ...
Thread shuffling: combining DVFS and thread migration toreduce energy consumptions for multi-core systems
ISLPED '11: Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design

In recent years, multi-core systems have become mainstream in computer industry. The design of multi-cores takes advantage of thread-level parallelism in emerging applications that are computationally intensive and highly parallel. Energy efficiency is ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S

ISCA'15

June 2015

745 pages

ISSN:0163-5964

DOI:10.1145/2872887

Editor:
Doug DeGroot
acm dot org

Issue’s Table of Contents

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
June 2015
768 pages
ISBN:9781450334020
DOI:10.1145/2749469
General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Published in SIGARCH Volume 43, Issue 3S

Check for updates

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
689
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)1

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Abdolrashidi AEsfeden HJahanshahi ASingh KAbu-Ghazaleh NWong DMartínez JDuato JJohn L(2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00034
Duţu ASinclair MBeckmann BWood DChow MMartínez JDuato JEeckhout L(2020)Independent forward progress of work-groupsProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00087(1022-1035)Online publication date: 30-May-2020
https://dl.acm.org/doi/10.1109/ISCA45697.2020.00087
Muhammed TMehmood RAlbeshri AKatib I(2019)SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUsApplied Sciences10.3390/app90509479:5(947)Online publication date: 6-Mar-2019
https://doi.org/10.3390/app9050947
Daiß GAmini PBiddiscombe JDiehl PFrank JHuck KKaiser HMarcello DPfander DPfüger DTaufer MBalaji PPeña A(2019)From piz daint to the starsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356221(1-37)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356221
Hetherington TLubeznov MShah DAamodt T(2019)EDGE: Event-Driven GPU ExecutionProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00034(336-352)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00034
Abdolrashidi ATripathy DBelviranli MBhuyan LWong DHunter HMoreno JEmer JSanchez D(2017)WireframeProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123976(600-611)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123976
Gomez-Luna JHajj IChang LGarcia-Flores Vde Gonzalo SJablin TPena AHwu W(2017)Chai: Collaborative heterogeneous applications for integrated-architectures2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2017.7975269(43-54)Online publication date: Apr-2017
https://doi.org/10.1109/ISPASS.2017.7975269
Xu QJeon HKim KRo WAnnavaram MMin SLoh G(2016)Warped-slicerProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.29(230-242)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1109/ISCA.2016.29
Wu HLi DBecchi M(2016)Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2016.98(534-543)Online publication date: May-2016
https://doi.org/10.1109/IPDPS.2016.98
Ahmad MKhan O(2016)GPU concurrency choices in graph analytics2016 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2016.7581278(1-10)Online publication date: Sep-2016
https://doi.org/10.1109/IISWC.2016.7581278
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents