skip to main content
10.1145/2925426.2926261acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Hybrid CPU-GPU scheduling and execution of tree traversals

Published: 01 June 2016 Publication History

Abstract

GPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in mapping irregular applications to GPUs: applications with unpredictable, data-dependent behaviors. While most of the work in this space has focused on ad hoc implementations of specific algorithms, recent work has looked at generic techniques for mapping a large class of tree traversal algorithms to GPUs, through careful restructuring of the tree traversal algorithms to make them behave more regularly. Unfortunately, even this general approach for GPU execution of tree traversal algorithms is reliant on ad hoc, hand-written, algorithm-specific scheduling (i.e., assignment of threads to warps) to achieve high performance.
The key challenge of scheduling is that it is a highly irregular process, that requires the inspection of thread behavior and then careful sorting of those threads into warps. In this paper, we present a novel scheduling and execution technique for tree traversal algorithms that is both general and automatic. The key novelty is a hybrid, inspector-executor approach: the GPU partially executes tasks to inspect thread behavior and transmits information back to the CPU, which uses that information to perform the scheduling itself, before executing the remaining, carefully scheduled, portion of the traversals on the GPU. We applied this framework to six tree traversal algorithms, achieving significant speedups over optimized GPU code that does not perform application-specific scheduling. Further, we show that in many cases, our hybrid approach is able to deliver better performance even than GPU code that uses handtuned, application-specific scheduling.

References

[1]
Timo Aila and Tero Karras. Architecture considerations for tracing incoherent rays. In Proceedings of the Conference on High Performance Graphics, HPG '10, pages 113--122, Aire-la-Ville, Switzerland, Switzerland, 2010. Eurographics Association.
[2]
Margarita Amor, Francisco Argüello, Juan López, Oscar G. Plata, and Emilio L. Zapata. A data parallel formulation of the barnes-hut method for n -body simulations. In Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia, pages 342--349, 2001.
[3]
J. Barnes and P. Hut. A hierarchical o(n log n) force-calculation algorithm. nature, 324:4, 1986.
[4]
Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18:509--517, September 1975.
[5]
Martin Burtscher and Keshav Pingali. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition, pages 75--92. Elsevier Inc., 2011.
[6]
Tim Foley and Jeremy Sugerman. Kd-tree acceleration structures for a gpu raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS '05, pages 15--22, 2005.
[7]
Michael Goldfarb, Youngjoon Jo, and Milind Kulkarni. General transformations for gpu execution of tree traversals. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, SC '13, pages 10:1--10:12, New York, NY, USA, 2013. ACM.
[8]
Johannes Gunther, Stefan Popov, Hans-Peter Seidel, and Philipp Slusallek. Realtime ray tracing on gpu with bvh-based packet traversal. In Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, RT '07, pages 113--118, 2007.
[9]
Hwansoo Han and Chau-Wen Tseng. Exploiting locality for irregular scientific codes. IEEE Trans. Parallel Distrib. Syst., 17:606--618, July 2006.
[10]
Michael Hapala, Tomas Davidovic, Ingo Wald, Vlastimil Havran, and Philipp Slusallek. Efficient Stack-less BVH Traversal for Ray Tracing. In Proceedings 27th Spring Conference of Computer Graphics (SCCG) 2011, pages 29--34, 2011.
[11]
Daniel Reiter Horn, Jeremy Sugerman, Mike Houston, and Pat Hanrahan. Interactive k-d tree gpu raytracing. In Proceedings of the 2007 symposium on Interactive 3D graphics and games, I3D '07, pages 167--174, 2007.
[12]
Youngjoon Jo, Michael Goldfarb, and Milind Kulkarni. Automatic vectorization of tree traversals. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, PACT'13, pages 363--374. IEEE, 2013.
[13]
Youngjoon Jo and Milind Kulkarni. Automatically enhancing locality for tree traversals with traversal splicing. In Proceedings of the ACM international conference on Object oriented programming systems languages and applications, OOPSLA '12, pages 355--374, New York, NY, USA, 2012. ACM.
[14]
Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian T. Lewis, Chunling Hu, and Keshav Pingali. Adaptive heterogeneous scheduling for integrated gpus. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 151--162. ACM New York, NY, USA, August 2014.
[15]
Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 45--55. ACM, December 2009.
[16]
M. Méndez-Lojo, M. Burtscher, and K. Pingali. A gpu implementation of inclusion-based points-to analysis. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 107--116. ACM, 2012.
[17]
D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 117--128, 2012.
[18]
Bochang Moon, Yongyoung Byun, Tae-Joon Kim, Pio Claudio, Hye-Sun Kim, Yun-Ji Ban, Seung Woo Nam, and Sung-Eui Yoon. Cache-oblivious ray reordering. ACM Trans. Graph., 29(3):28:1--28:10, July 2010.
[19]
Paul Arthur Navratil. Memory-efficient, scalable ray tracing. PhD thesis, 2010.
[20]
Paul Arthur Navratil, Donald S. Fussell, Calvin Lin, and William R. Mark. Dynamic ray scheduling to improve ray coherence and bandwidth utilization. In Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing, RT '07, pages 95--104, Washington, DC, USA, 2007. IEEE Computer Society.
[21]
Matt Pharr, Craig Kolb, Reid Gershbein, and Pat Hanrahan. Rendering complex scenes with memory-coherent ray tracing. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 101--108, 1997.
[22]
Venkata K. Pingali, Sally A. McKee, Wilson C. Hseih, and John B. Carter. Computation regrouping: Restructuring programs for temporal data cache locality. In Proceedings of the 16th International Conference on Supercomputing, ICS '02, pages 252--261, New York, NY, USA, 2002. ACM.
[23]
Stefan Popov, Johannes Günther, Hans-Peter Seidel, and Philipp Slusallek. Stackless kd-tree traversal for high performance GPU ray tracing. Computer Graphics Forum, 26(3):415--424, September 2007. (Proceedings of Eurographics).
[24]
Vignesh T. Ravi and Gagan Agrawal. A dynamic scheduling framework for emerging heterogeneous systems. In Proceedings of the 2011 18th International Conference on High Performance Computing, pages 1--10. IEEE Computer Society Washington, DC, USA, 2011.
[25]
Vignesh T. Ravi, Wenjing Ma, and Gagan Agrawal. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In Proceeding of the 24th ACM International Conference on Supercomputing, pages 137--146. ACM, June 2010.
[26]
John K. Salmon. Parallel hierarchical N-body methods. PhD thesis, California Institute of Technology, 1991.
[27]
Joel H. Saltz and Ravi Mirchandaney. Run-time parallelization and scheduling of loops. In IEEE Transactions on Computers, volume 40, pages 603--612. IEEE, May 1991.
[28]
Jaswinder Pal Singh, Chris Holt, Takashi Totsuka, Anoop Gupta, and John Hennessy. Load balancing and data locality in adaptive hierarchical n-body methods: Barnes-hut, fast multipole, and radiosity. J. Parallel Distrib. Comput., 27(2):118--141, 1995.
[29]
Ren Wu, Bin Zhang, and Meichun Hsu. Clustering billions of data points using gpus. In Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop, pages 1--6. ACM, May 2009.
[30]
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, pages 369--380. ACM New York, NY, USA ©2011, March 2011.
[31]
Xingbin Zhang and Andrew A. Chien. Dynamic pointer alignment: Tiling and communication optimizations for parallel pointer-based computations. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '97, pages 37--47, New York, NY, USA, 1997. ACM.

Cited By

View all
  • (2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
  • (2023)Energy-Efficient Realtime Motion PlanningProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589092(1-17)Online publication date: 17-Jun-2023
  • (2022)Treebeard: An Optimizing Compiler for Decision Tree Based ML InferenceProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00043(494-511)Online publication date: 1-Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '16: Proceedings of the 2016 International Conference on Supercomputing
June 2016
547 pages
ISBN:9781450343619
DOI:10.1145/2925426
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. Heterogeneous architectures
  3. Irregular applications
  4. Scheduling
  5. Tree traversal

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICS '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,330
  • Downloads (Last 6 weeks)274
Reflects downloads up to 09 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
  • (2023)Energy-Efficient Realtime Motion PlanningProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589092(1-17)Online publication date: 17-Jun-2023
  • (2022)Treebeard: An Optimizing Compiler for Decision Tree Based ML InferenceProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00043(494-511)Online publication date: 1-Oct-2022
  • (2020)MEPHESTOProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414671(413-425)Online publication date: 30-Sep-2020
  • (2019)Efficient GPU tree walks for effective distributed n-body simulationsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330348(24-34)Online publication date: 26-Jun-2019
  • (2019)Optimizing GPU programs by register demotionProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3297859(405-406)Online publication date: 16-Feb-2019
  • (2018)PokerACM Transactions on Architecture and Code Optimization10.1145/328085015:4(1-28)Online publication date: 16-Nov-2018
  • (2018)Poker: permutation-based SIMD execution of intensive tree search by path encodingProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168808(87-99)Online publication date: 24-Feb-2018

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media