research-article

Public Access

Hybrid CPU-GPU scheduling and execution of tree traversals

Authors:

Milind KulkarniAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 2, Pages 1 - 12

https://doi.org/10.1145/2925426.2926261

Published: 01 June 2016 Publication History

Abstract

GPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in mapping irregular applications to GPUs: applications with unpredictable, data-dependent behaviors. While most of the work in this space has focused on ad hoc implementations of specific algorithms, recent work has looked at generic techniques for mapping a large class of tree traversal algorithms to GPUs, through careful restructuring of the tree traversal algorithms to make them behave more regularly. Unfortunately, even this general approach for GPU execution of tree traversal algorithms is reliant on ad hoc, hand-written, algorithm-specific scheduling (i.e., assignment of threads to warps) to achieve high performance.

The key challenge of scheduling is that it is a highly irregular process, that requires the inspection of thread behavior and then careful sorting of those threads into warps. In this paper, we present a novel scheduling and execution technique for tree traversal algorithms that is both general and automatic. The key novelty is a hybrid, inspector-executor approach: the GPU partially executes tasks to inspect thread behavior and transmits information back to the CPU, which uses that information to perform the scheduling itself, before executing the remaining, carefully scheduled, portion of the traversals on the GPU. We applied this framework to six tree traversal algorithms, achieving significant speedups over optimized GPU code that does not perform application-specific scheduling. Further, we show that in many cases, our hybrid approach is able to deliver better performance even than GPU code that uses handtuned, application-specific scheduling.

References

[1]

Timo Aila and Tero Karras. Architecture considerations for tracing incoherent rays. In Proceedings of the Conference on High Performance Graphics, HPG '10, pages 113--122, Aire-la-Ville, Switzerland, Switzerland, 2010. Eurographics Association.

Digital Library

[2]

Margarita Amor, Francisco Argüello, Juan López, Oscar G. Plata, and Emilio L. Zapata. A data parallel formulation of the barnes-hut method for n -body simulations. In Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia, pages 342--349, 2001.

Digital Library

[3]

J. Barnes and P. Hut. A hierarchical o(n log n) force-calculation algorithm. nature, 324:4, 1986.

[4]

Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18:509--517, September 1975.

Digital Library

[5]

Martin Burtscher and Keshav Pingali. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition, pages 75--92. Elsevier Inc., 2011.

[6]

Tim Foley and Jeremy Sugerman. Kd-tree acceleration structures for a gpu raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS '05, pages 15--22, 2005.

Digital Library

[7]

Michael Goldfarb, Youngjoon Jo, and Milind Kulkarni. General transformations for gpu execution of tree traversals. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, SC '13, pages 10:1--10:12, New York, NY, USA, 2013. ACM.

Digital Library

[8]

Johannes Gunther, Stefan Popov, Hans-Peter Seidel, and Philipp Slusallek. Realtime ray tracing on gpu with bvh-based packet traversal. In Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, RT '07, pages 113--118, 2007.

Digital Library

[9]

Hwansoo Han and Chau-Wen Tseng. Exploiting locality for irregular scientific codes. IEEE Trans. Parallel Distrib. Syst., 17:606--618, July 2006.

Digital Library

[10]

Michael Hapala, Tomas Davidovic, Ingo Wald, Vlastimil Havran, and Philipp Slusallek. Efficient Stack-less BVH Traversal for Ray Tracing. In Proceedings 27th Spring Conference of Computer Graphics (SCCG) 2011, pages 29--34, 2011.

Digital Library

[11]

Daniel Reiter Horn, Jeremy Sugerman, Mike Houston, and Pat Hanrahan. Interactive k-d tree gpu raytracing. In Proceedings of the 2007 symposium on Interactive 3D graphics and games, I3D '07, pages 167--174, 2007.

Digital Library

[12]

Youngjoon Jo, Michael Goldfarb, and Milind Kulkarni. Automatic vectorization of tree traversals. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, PACT'13, pages 363--374. IEEE, 2013.

Digital Library

[13]

Youngjoon Jo and Milind Kulkarni. Automatically enhancing locality for tree traversals with traversal splicing. In Proceedings of the ACM international conference on Object oriented programming systems languages and applications, OOPSLA '12, pages 355--374, New York, NY, USA, 2012. ACM.

Digital Library

[14]

Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian T. Lewis, Chunling Hu, and Keshav Pingali. Adaptive heterogeneous scheduling for integrated gpus. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 151--162. ACM New York, NY, USA, August 2014.

Digital Library

[15]

Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 45--55. ACM, December 2009.

Digital Library

[16]

M. Méndez-Lojo, M. Burtscher, and K. Pingali. A gpu implementation of inclusion-based points-to analysis. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 107--116. ACM, 2012.

Digital Library

[17]

D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 117--128, 2012.

Digital Library

[18]

Bochang Moon, Yongyoung Byun, Tae-Joon Kim, Pio Claudio, Hye-Sun Kim, Yun-Ji Ban, Seung Woo Nam, and Sung-Eui Yoon. Cache-oblivious ray reordering. ACM Trans. Graph., 29(3):28:1--28:10, July 2010.

Digital Library

[19]

Paul Arthur Navratil. Memory-efficient, scalable ray tracing. PhD thesis, 2010.

[20]

Paul Arthur Navratil, Donald S. Fussell, Calvin Lin, and William R. Mark. Dynamic ray scheduling to improve ray coherence and bandwidth utilization. In Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, RT '07, pages 95--104, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[21]

Matt Pharr, Craig Kolb, Reid Gershbein, and Pat Hanrahan. Rendering complex scenes with memory-coherent ray tracing. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 101--108, 1997.

Digital Library

[22]

Venkata K. Pingali, Sally A. McKee, Wilson C. Hseih, and John B. Carter. Computation regrouping: Restructuring programs for temporal data cache locality. In Proceedings of the 16th International Conference on Supercomputing, ICS '02, pages 252--261, New York, NY, USA, 2002. ACM.

Digital Library

[23]

Stefan Popov, Johannes Günther, Hans-Peter Seidel, and Philipp Slusallek. Stackless kd-tree traversal for high performance GPU ray tracing. Computer Graphics Forum, 26(3):415--424, September 2007. (Proceedings of Eurographics).

[24]

Vignesh T. Ravi and Gagan Agrawal. A dynamic scheduling framework for emerging heterogeneous systems. In Proceedings of the 2011 18th International Conference on High Performance Computing, pages 1--10. IEEE Computer Society Washington, DC, USA, 2011.

Digital Library

[25]

Vignesh T. Ravi, Wenjing Ma, and Gagan Agrawal. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In Proceeding of the 24th ACM International Conference on Supercomputing, pages 137--146. ACM, June 2010.

Digital Library

[26]

John K. Salmon. Parallel hierarchical N-body methods. PhD thesis, California Institute of Technology, 1991.

Digital Library

[27]

Joel H. Saltz and Ravi Mirchandaney. Run-time parallelization and scheduling of loops. In IEEE Transactions on Computers, volume 40, pages 603--612. IEEE, May 1991.

Digital Library

[28]

Jaswinder Pal Singh, Chris Holt, Takashi Totsuka, Anoop Gupta, and John Hennessy. Load balancing and data locality in adaptive hierarchical n-body methods: Barnes-hut, fast multipole, and radiosity. J. Parallel Distrib. Comput., 27(2):118--141, 1995.

Digital Library

[29]

Ren Wu, Bin Zhang, and Meichun Hsu. Clustering billions of data points using gpus. In Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop, pages 1--6. ACM, May 2009.

Digital Library

[30]

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, pages 369--380. ACM New York, NY, USA ©2011, March 2011.

Digital Library

[31]

Xingbin Zhang and Andrew A. Chien. Dynamic pointer alignment: Tiling and communication optimizations for parallel pointer-based computations. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '97, pages 37--47, New York, NY, USA, 1997. ACM.

Digital Library

Cited By

Prasad ARajendra SRajan KGovindarajan RBondhugula UWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695958
Shah DYang NAamodt TSolihin YHeinrich M(2023)Energy-Efficient Realtime Motion PlanningProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589092(1-17)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589092
Prasad ARajendra SRajan KGovindarajan RBondhugula UHardavellas NCampanoni SGrot BKarpuzcu U(2022)Treebeard: An Optimizing Compiler for Decision Tree Based ML InferenceProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00043(494-511)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1109/MICRO56248.2022.00043
Show More Cited By

Recommendations

General transformations for GPU execution of tree traversals
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

With the advent of programmer-friendly GPU computing environments, there has been much interest in offloading workloads that can exploit the high degree of parallelism available on modern GPUs. Exploiting this parallelism and optimizing for the GPU ...
Hybrid CPU-GPU scheduling and execution of tree traversals
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

GPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in ...
Hybrid CPU-GPU scheduling and execution of tree traversals
PPoPP '16

GPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
1,998
Total Downloads

Downloads (Last 12 months)1,330
Downloads (Last 6 weeks)274

Reflects downloads up to 09 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Prasad ARajendra SRajan KGovindarajan RBondhugula UWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695958
Shah DYang NAamodt TSolihin YHeinrich M(2023)Energy-Efficient Realtime Motion PlanningProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589092(1-17)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589092
Prasad ARajendra SRajan KGovindarajan RBondhugula UHardavellas NCampanoni SGrot BKarpuzcu U(2022)Treebeard: An Optimizing Compiler for Decision Tree Based ML InferenceProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00043(494-511)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1109/MICRO56248.2022.00043
Monil MBelviranli MLee SVetter JMalony ASarkar VKim H(2020)MEPHESTOProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414671(413-425)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414671
Liu JRobson MQuinn TKulkarni MEigenmann RDing CMcKee S(2019)Efficient GPU tree walks for effective distributed n-body simulationsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330348(24-34)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330348
Sakdhnagool PSabne AEigenmann RHollingsworth JKeidar I(2019)Optimizing GPU programs by register demotionProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3297859(405-406)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3297859
Zhang FXue J(2018)PokerACM Transactions on Architecture and Code Optimization10.1145/328085015:4(1-28)Online publication date: 16-Nov-2018
https://dl.acm.org/doi/10.1145/3280850
Zhang FXue JKnoop JSchordan MJohnson TO'Boyle M(2018)Poker: permutation-based SIMD execution of intensive tree search by path encodingProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168808(87-99)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3168808

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten