skip to main content
research-article

Dynamic thread block launch: a lightweight execution mechanism to support irregular applications on GPUs

Published: 13 June 2015 Publication History

Abstract

GPUs have been proven effective for structured applications that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced nested device-side kernel launching functionality in the GPU is a step in the right direction, but still falls short of being able to effectively harness the GPUs performance potential.
We propose a new mechanism called Dynamic Thread Block Launch (DTBL) to extend the current bulk synchronous parallel model underlying the current GPU execution model by supporting dynamic spawning of lightweight thread blocks. This mechanism supports the nested launching of thread blocks rather than kernels to execute dynamically occurring parallel work elements. This paper describes the execution model of DTBL, device-runtime support, and microarchitecture extensions to track and execute dynamically spawned thread blocks. Experiments with a set of irregular data intensive CUDA applications executing on a cycle-level simulator show that DTBL achieves average 1.21x speedup over the original flat implementation and average 1.40x over the implementation with device-side kernel launches using CUDA Dynamic Parallelism.

References

[1]
"Global flight network." {Online}. Available: http://www.visualizing.org/datasets/global-flights-network
[2]
J. Adriaens, K. Compton, N. S. Kim, and M. Schulte, "The case for gpgpu spatial multitasking," in High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, 2012.
[3]
J. A. Anderson, C. D. Lorenz, and A. Travesset, "General purpose molecular dynamics simulations fully implemented on graphics processing units," Journal of Computational Physics, vol. 227, no. 10, pp. 5342--5359, 2008.
[4]
D. A. Bader, H. Meyerhenke, P. Sanders, and D. Wagner, "10th dimacs implementation challenge: Graph partitioning and graph clustering, 2011."
[5]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing cuda workloads using a detailed gpu simulator," in 2009 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS 2009), April 2009, pp. 163--174.
[6]
L. Bergstrom and J. Reppy, "Nested data-parallelism on the gpu," in ACM SIGPLAN Notices, vol. 47, no. 9. ACM, 2012, pp. 247--258.
[7]
M. Burtscher, R. Nasre, and K. Pingali, "A quantitative study of irregular programs on gpus," in Workload Characterization (IISWC), 2012 IEEE International Symposium on. IEEE, 2012, pp. 141--151.
[8]
M. Burtscher and K. Pingali, "An efficient cu da implementation of the tree-based barnes hut n-body algorithm," GPU computing Gems Emerald edition, p. 75, 2011.
[9]
S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, "Pannotia: Understanding irregular gpgpu graph applications," in Workload Characterization (IISWC), 2013 IEEE International Symposium on. IEEE, 2013, pp. 185--195.
[10]
J. Cohen and P. Castonguay, "Efficient graph matching and coloring on the gpu," in GPU Technology Conference, 2012.
[11]
B. W. Coon, J. R. Nickolls, J. E. Lindholm, R. J. Stoll, N. Wang, and J. H. Choquette, "Thread group scheduler for computing on a parallel thread processor," US Patent 8,732,713, 2014.
[12]
G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili, "Relational algorithms for multi-bulk-synchronous processors," in 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'13), February 2013.
[13]
W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient gpu control flow," in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2007, pp. 407--420.
[14]
B. R. Gaster and L. Howes, "Can gpgpu programming be liberated from the data-parallel bottleneck?" Computer, vol. 45, no. 8, pp. 42--52, 2012.
[15]
K. Gupta, J. A. Stuart, and J. D. Owens, "A study of persistent threads style gpu programming for gpgpu workloads," in Innovative Parallel Computing (InPar), 2012. IEEE, 2012, pp. 1--14.
[16]
J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, "An algorithmic framework for performing collaborative filtering," in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 1999.
[17]
Khronos, "The opencl specification version 2.0," 2014.
[18]
A. Kuhl, "Thermodynamic states in explosion fields," in 14th International Symposium on Detonation, Coeur d'Alene Resort, ID, USA, 2010.
[19]
M. Kulkarni, M. Burtscher, C. Casçaval, and K. Pingali, "Lonestar: A suite of parallel irregular programs," in ISPASS '09: IEEE International Symposium on Performance Analysis of Systems and Software, 2009.
[20]
H. Lee, K. Brown, A. Sujeeth, T. Rompf, and K. Olukotun, "Locality-aware mapping of nested parallel patterns on gpus," in the 47th International Symposium on Microarchitecture (MICRO '14), 2014.
[21]
J. McHugh, "Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory," ACM transactions on Information and system Security, vol. 3, no. 4, pp. 262--294, 2000.
[22]
J. Meng, D. Tarjan, and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance," in ACM SIGARCH Computer Architecture News, vol. 38, no. 3, 2010, pp. 235--246.
[23]
D. Merrill, M. Garland, and A. Grimshaw, "Scalable gpu graph traversal," in In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'12, 2012.
[24]
J. Mosegaard and T. S. Sørensen, "Real-time deformation of detailed geometry based on mappings to a less detailed physical simulation on the gpu," in Proceedings of the 11th Eurographics conference on Virtual Environments. Eurographics Association, 2005, pp. 105--111.
[25]
C. H. Nadungodage, Y. Xia, J. J. Lee, M. Lee, and C. S. Park, "Gpu accelerated item-based collaborative filtering for big-data applications," in Big Data, 2013 IEEE International Conference on. IEEE, 2013, pp. 175--180.
[26]
NVIDIA, "Hyperq sample," 2012.
[27]
NVIDIA, "Cuda c programming guide version 6.5," 2014.
[28]
NVIDIA, "Cuda dynamic parallelism programming guide," 2014.
[29]
M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood, "Fine-grain task aggregation and coordination on gpus," in Proceeding of the 41st Annual International Symposium on Computer Architecuture, ser. ISCA '14, 2014.
[30]
S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison et al., "Optix: a general purpose ray tracing engine," in ACM Transactions on Graphics (TOG), vol. 29, no. 4. ACM, 2010, p. 66.
[31]
V. Podlozhnyuk, "Black-scholes option pricing," Nvidia, 2007.
[32]
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.
[33]
S. Solomon and P. Thulasiraman, "Performance study of mapping irregular computations on gpus," in Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1--8.
[34]
M. Steffen and J. Zambreno, "Improving simt efficiency of global rendering algorithms with architectural support for dynamic micro-kernels," in Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO '43, 2010.
[35]
I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero, "Enabling preemptive multiprogramming on gpus," in Proceeding of the 41st Annual International Symposium on Computer Architecuture, 2014.
[36]
J. Wang and S. Yalamanchili, "Characterization and analysis of dynamic parallelism in unstructured gpu applications," in 2014 IEEE International Symposium on Workload Characterization, October 2014.
[37]
L. Wang, S. Chen, Y. Tang, and J. Su, "Gregex: Gpu based high speed regular expression matching engine," in Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2011 Fifth International Conference on. IEEE, 2011, pp. 366--370.
[38]
H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili, "Kernel weaver: Automatically fusing database primitives for efficient gpu computation," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.
[39]
Y. Yang and H. Zhou, "Cuda-np: Realizing nested thread-level parallelism in gpgpu applications," in Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2014.
[40]
E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen, "On-the-fly elimination of dynamic irregularities for gpu computing," in ACM SIGARCH Computer Architecture News, vol. 39, no. 1, 2011, pp. 369--380.

Cited By

View all
  • (2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
  • (2020)Independent forward progress of work-groupsProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00087(1022-1035)Online publication date: 30-May-2020
  • (2019)SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUsApplied Sciences10.3390/app90509479:5(947)Online publication date: 6-Mar-2019
  • Show More Cited By

Index Terms

  1. Dynamic thread block launch: a lightweight execution mechanism to support irregular applications on GPUs

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 43, Issue 3S
        ISCA'15
        June 2015
        745 pages
        ISSN:0163-5964
        DOI:10.1145/2872887
        Issue’s Table of Contents
        • cover image ACM Conferences
          ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
          June 2015
          768 pages
          ISBN:9781450334020
          DOI:10.1145/2749469
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 13 June 2015
        Published in SIGARCH Volume 43, Issue 3S

        Check for updates

        Qualifiers

        • Research-article

        Funding Sources

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)17
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 02 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
        • (2020)Independent forward progress of work-groupsProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00087(1022-1035)Online publication date: 30-May-2020
        • (2019)SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUsApplied Sciences10.3390/app90509479:5(947)Online publication date: 6-Mar-2019
        • (2019)From piz daint to the starsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356221(1-37)Online publication date: 17-Nov-2019
        • (2019)EDGE: Event-Driven GPU ExecutionProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00034(336-352)Online publication date: 23-Sep-2019
        • (2017)WireframeProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123976(600-611)Online publication date: 14-Oct-2017
        • (2017)Chai: Collaborative heterogeneous applications for integrated-architectures2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2017.7975269(43-54)Online publication date: Apr-2017
        • (2016)Warped-slicerProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.29(230-242)Online publication date: 18-Jun-2016
        • (2016)Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2016.98(534-543)Online publication date: May-2016
        • (2016)GPU concurrency choices in graph analytics2016 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2016.7581278(1-10)Online publication date: Sep-2016
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media