ABSTRACT
There is growing interest in using GPUs to accelerate graph algorithms such as breadth-first search, computing page-ranks, and finding shortest paths. However, these algorithms do not modify the graph structure, so their implementation is relatively easy compared to general graph algorithms like mesh generation and refinement, which morph the underlying graph in non-trivial ways by adding and removing nodes and edges. We know relatively little about how to implement morph algorithms efficiently on GPUs.
In this paper, we present and study four morph algorithms: (i) a computational geometry algorithm called Delaunay Mesh Refinement (DMR), (ii) an approximate SAT solver called Survey Propagation (SP), (iii) a compiler analysis called Points-To Analysis (PTA), and (iv) Boruvka's Minimum Spanning Tree algorithm (MST). Each of these algorithms modifies the graph data structure in different ways and thus poses interesting challenges.
We overcome these challenges using algorithmic and GPU-specific optimizations. We propose efficient techniques to perform concurrent subgraph addition, subgraph deletion, conflict detection and several optimizations to improve the scalability of morph algorithms. For an input mesh with 10 million triangles, our DMR code achieves an 80x speedup over the highly optimized serial Triangle program and a 2.3x speedup over a multicore implementation running with 48 threads. Our SP code is 3x faster than a multicore implementation with 48 threads on an input with 1 million literals. The PTA implementation is able to analyze six SPEC 2000 benchmark programs in just 74 milliseconds, achieving a geometric mean speedup of 9.3x over a 48-thread multicore version. Our MST code is slower than a multicore version with 48 threads for sparse graphs but significantly faster for denser graphs.
This work provides several insights into how other morph algorithms can be efficiently implemented on GPUs.
- L. O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, DIKU, University of Copenhagen, May 1994. (DIKU report 94/19).Google Scholar
- David A. Bader and Kamesh Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In Proceedings of the 2006 International Conference on Parallel Processing, ICPP '06, pages 523--530, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- J. Barnat, P. Bauch, L. Brim, and M.Ceska. Computing Strongly Connected Components in Parallel on CUDA. In Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS'11), pages 541--552. IEEE Computer Society, 2011. Google ScholarDigital Library
- A. Braunstein, M. Mèzard, and R. Zecchina. Survey propagation: An algorithm for satisfiability. Random Structures and Algorithms, 27(2):201--226, 2005. Google ScholarDigital Library
- Martin Burtscher and Keshav Pingali. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition, pages 75--92. Morgan Kaufmann, 2011.Google ScholarCross Ref
- Andrey N. Chernikov and Nikos P. Chrisochoides. Three-dimensional delaunay refinement for multi-core processors. In Proceedings of the 22nd annual international conference on Supercomputing, ICS '08, pages 214--224, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- L. Paul Chew. Guaranteed-quality mesh generation for curved surfaces. In Proc. Symp. on Computational Geometry (SCG), 1993. Google ScholarDigital Library
- Panagiotis A. Foteinos, Andrey N. Chernikov, and Nikos P. Chrisochoides. Fully generalized two-dimensional constrained delaunay mesh refinement. SIAM J. Sci. Comput., 32(5):2659--2686, 2010. Google ScholarDigital Library
- Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto, and Matei Ripeanu. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In The 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, 2012. Google ScholarDigital Library
- Pawan Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In HiPC'07: Proceedings of the 14th international conference on High performance computing, pages 197--208, Berlin, Heidelberg, 2007. Springer-Verlag. Google ScholarDigital Library
- Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. Accelerating cuda graph algorithms at maximum warp. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. Efficient parallel graph exploration on multi-core cpu and gpu. In 20th International Conference on Parallel Architectures and Compilation Techniques, PACT'11, 2011. Google ScholarDigital Library
- Mark T. Jones and Paul E. Plassmann. A parallel graph coloring heuristic. SIAM J. Sci. Comput., 14(3):654--669, May 1993. Google ScholarDigital Library
- Andriy Kot, Andrey Chernikov, and Nikos Chrisochoides. Effective out-of-core parallel delaunay mesh refinement using off-the-shelf software. In Proceedings of the 20th international conference on Parallel and distributed processing, IPDPS'06, pages 125--125, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu, Keshav Pingali, and Calin Casçaval. How much parallelism is there in irregular applications? In Proc. Symp. on Principles and practice of parallel programming (PPoPP), pages 3--14, New York, NY, USA, 2009. Google ScholarDigital Library
- Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. Optimistic parallelism requires abstractions. SIGPLAN Not. (Proceedings of PLDI), 42(6):211--222, 2007. Google ScholarDigital Library
- Lijuan Luo, Martin Wong, and Wen-mei Hwu. An effective gpu implementation of breadth-first search. In Proceedings of the 47th Design Automation Conference, DAC '10, pages 52--55, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- Mario Mendez-Lojo, Martin Burtscher, and Keshav Pingali. A gpu implementation of inclusion-based points-to analysis. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 107--116, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- Mario Méndez-Lojo, Donald Nguyen, Dimitrios Prountzos, Xin Sui, M. Amber Hassaan, Milind Kulkarni, Martin Burtscher, and Keshav Pingali. Structure-driven optimizations for amorphous data-parallel programs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practiceof Parallel Programming (PPoPP'10), pages 3--14, January 2010. Google ScholarDigital Library
- Duane G. Merrill, Michael Garland, and Andrew S. Grimshaw. Scalable gpu graph traversal. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'12, 2012. Google ScholarDigital Library
- Stephan Mertens, Marc Mézard, and Riccardo Zecchina. Threshold values of random k-sat from the cavity method. Random Struct. Algorithms, 28(3):340--373, May 2006. Google ScholarDigital Library
- Cristobal A. Navarro, Nancy Hitschfeld-Kahler, and Eliana Scheihing. A parallel gpu-based algorithm for delaunay edge-flips. In The 27th European Workshop on Computational Geometry, EuroCG '11, 2011.Google Scholar
- Sadegh Nobari, Thanh-Tung Cao, Panagiotis Karras, and Stéphane Bressan. Scalable parallel minimum spanning forest computation. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 205--214, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 12--25, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- Tarun Prabhu, Shreyas Ramalingam, Matthew Might, and Mary Hall. Eigencfa: accelerating flow analysis with gpus. In Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '11, pages 511--522, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- Sandeep Putta and Rupesh Nasre. Parallel replication-based points-to analysis. In Proceedings of the 21st international conference on Compiler Construction, CC'12, pages 61--80, Berlin, Heidelberg, 2012. Springer-Verlag. Google ScholarDigital Library
- Meng Qi, Thanh-Tung Cao, and Tiow-Seng Tan. Computing 2d constrained delaunay triangulation using the gpu. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, I3D '12, pages 39--46, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- Jonathan Richard Shewchuk. Triangle: Engineering a 2d quality mesh generator and Delaunay triangulator. In Applied Computational Geometry: Towards Geometric Engineering, volume 1148 of Lecture Notes in Computer Science, pages 203--222. Springer-Verlag, 1996. Google ScholarDigital Library
- Jeff A. Stuart and John D. Owens. Efficient synchronization primitives for gpus. CoRR, abs/1110.4623, 2011.Google Scholar
- Vibhav Vineet, Pawan Harish, Suryakant Patidar, and P. J. Narayanan. Fast minimum spanning tree for large graphs on the gpu. In Proceedings of the Conference on High Performance Graphics 2009, HPG '09, pages 167--171, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- Shucai Xiao and Wu chun Feng. Inter-block gpu communication via fast barrier synchronization. In IPDPS, pages 1--12. IEEE, 2010.Google Scholar
- Andy Yoo, Edmond Chow, Keith Henderson, William McLendon, Bruce Hendrickson, and Umit Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, SC '05, pages 25--, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '11, pages 369--380, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Index Terms
- Morph algorithms on GPUs
Recommendations
Morph algorithms on GPUs
PPoPP '13There is growing interest in using GPUs to accelerate graph algorithms such as breadth-first search, computing page-ranks, and finding shortest paths. However, these algorithms do not modify the graph structure, so their implementation is relatively ...
A GPU implementation of inclusion-based points-to analysis
PPOPP '12Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate ...
A GPU implementation of inclusion-based points-to analysis
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingGraphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate ...
Comments