research-article

Morph algorithms on GPUs

Authors:
Rupesh Nasre

The University of Texas, Austin, USA

The University of Texas, Austin, USA
View Profile

,
Martin Burtscher

Texas State University, San Marcos, USA

Texas State University, San Marcos, USA
View Profile

,
Keshav Pingali

The University of Texas, Austin, USA

The University of Texas, Austin, USA
View Profile

PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programmingFebruary 2013Pages 147–156https://doi.org/10.1145/2442516.2442531

Published:23 February 2013Publication History

PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 147–156

ABSTRACT

There is growing interest in using GPUs to accelerate graph algorithms such as breadth-first search, computing page-ranks, and finding shortest paths. However, these algorithms do not modify the graph structure, so their implementation is relatively easy compared to general graph algorithms like mesh generation and refinement, which morph the underlying graph in non-trivial ways by adding and removing nodes and edges. We know relatively little about how to implement morph algorithms efficiently on GPUs.

In this paper, we present and study four morph algorithms: (i) a computational geometry algorithm called Delaunay Mesh Refinement (DMR), (ii) an approximate SAT solver called Survey Propagation (SP), (iii) a compiler analysis called Points-To Analysis (PTA), and (iv) Boruvka's Minimum Spanning Tree algorithm (MST). Each of these algorithms modifies the graph data structure in different ways and thus poses interesting challenges.

We overcome these challenges using algorithmic and GPU-specific optimizations. We propose efficient techniques to perform concurrent subgraph addition, subgraph deletion, conflict detection and several optimizations to improve the scalability of morph algorithms. For an input mesh with 10 million triangles, our DMR code achieves an 80x speedup over the highly optimized serial Triangle program and a 2.3x speedup over a multicore implementation running with 48 threads. Our SP code is 3x faster than a multicore implementation with 48 threads on an input with 1 million literals. The PTA implementation is able to analyze six SPEC 2000 benchmark programs in just 74 milliseconds, achieving a geometric mean speedup of 9.3x over a 48-thread multicore version. Our MST code is slower than a multicore version with 48 threads for sparse graphs but significantly faster for denser graphs.

This work provides several insights into how other morph algorithms can be efficiently implemented on GPUs.

References

L. O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, DIKU, University of Copenhagen, May 1994. (DIKU report 94/19).Google Scholar
David A. Bader and Kamesh Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In Proceedings of the 2006 International Conference on Parallel Processing, ICPP '06, pages 523--530, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
J. Barnat, P. Bauch, L. Brim, and M.Ceska. Computing Strongly Connected Components in Parallel on CUDA. In Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS'11), pages 541--552. IEEE Computer Society, 2011. Google ScholarDigital Library
A. Braunstein, M. Mèzard, and R. Zecchina. Survey propagation: An algorithm for satisfiability. Random Structures and Algorithms, 27(2):201--226, 2005. Google ScholarDigital Library
Martin Burtscher and Keshav Pingali. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition, pages 75--92. Morgan Kaufmann, 2011.Google ScholarCross Ref
Andrey N. Chernikov and Nikos P. Chrisochoides. Three-dimensional delaunay refinement for multi-core processors. In Proceedings of the 22nd annual international conference on Supercomputing, ICS '08, pages 214--224, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
L. Paul Chew. Guaranteed-quality mesh generation for curved surfaces. In Proc. Symp. on Computational Geometry (SCG), 1993. Google ScholarDigital Library
Panagiotis A. Foteinos, Andrey N. Chernikov, and Nikos P. Chrisochoides. Fully generalized two-dimensional constrained delaunay mesh refinement. SIAM J. Sci. Comput., 32(5):2659--2686, 2010. Google ScholarDigital Library
Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto, and Matei Ripeanu. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In The 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, 2012. Google ScholarDigital Library
Pawan Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In HiPC'07: Proceedings of the 14th international conference on High performance computing, pages 197--208, Berlin, Heidelberg, 2007. Springer-Verlag. Google ScholarDigital Library
Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. Accelerating cuda graph algorithms at maximum warp. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. Efficient parallel graph exploration on multi-core cpu and gpu. In 20th International Conference on Parallel Architectures and Compilation Techniques, PACT'11, 2011. Google ScholarDigital Library
Mark T. Jones and Paul E. Plassmann. A parallel graph coloring heuristic. SIAM J. Sci. Comput., 14(3):654--669, May 1993. Google ScholarDigital Library
Andriy Kot, Andrey Chernikov, and Nikos Chrisochoides. Effective out-of-core parallel delaunay mesh refinement using off-the-shelf software. In Proceedings of the 20th international conference on Parallel and distributed processing, IPDPS'06, pages 125--125, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu, Keshav Pingali, and Calin Casçaval. How much parallelism is there in irregular applications? In Proc. Symp. on Principles and practice of parallel programming (PPoPP), pages 3--14, New York, NY, USA, 2009. Google ScholarDigital Library
Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. Optimistic parallelism requires abstractions. SIGPLAN Not. (Proceedings of PLDI), 42(6):211--222, 2007. Google ScholarDigital Library
Lijuan Luo, Martin Wong, and Wen-mei Hwu. An effective gpu implementation of breadth-first search. In Proceedings of the 47th Design Automation Conference, DAC '10, pages 52--55, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
Mario Mendez-Lojo, Martin Burtscher, and Keshav Pingali. A gpu implementation of inclusion-based points-to analysis. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 107--116, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
Mario Méndez-Lojo, Donald Nguyen, Dimitrios Prountzos, Xin Sui, M. Amber Hassaan, Milind Kulkarni, Martin Burtscher, and Keshav Pingali. Structure-driven optimizations for amorphous data-parallel programs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practiceof Parallel Programming (PPoPP'10), pages 3--14, January 2010. Google ScholarDigital Library
Duane G. Merrill, Michael Garland, and Andrew S. Grimshaw. Scalable gpu graph traversal. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'12, 2012. Google ScholarDigital Library
Stephan Mertens, Marc Mézard, and Riccardo Zecchina. Threshold values of random k-sat from the cavity method. Random Struct. Algorithms, 28(3):340--373, May 2006. Google ScholarDigital Library
Cristobal A. Navarro, Nancy Hitschfeld-Kahler, and Eliana Scheihing. A parallel gpu-based algorithm for delaunay edge-flips. In The 27th European Workshop on Computational Geometry, EuroCG '11, 2011.Google Scholar
Sadegh Nobari, Thanh-Tung Cao, Panagiotis Karras, and Stéphane Bressan. Scalable parallel minimum spanning forest computation. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 205--214, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 12--25, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Tarun Prabhu, Shreyas Ramalingam, Matthew Might, and Mary Hall. Eigencfa: accelerating flow analysis with gpus. In Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '11, pages 511--522, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Sandeep Putta and Rupesh Nasre. Parallel replication-based points-to analysis. In Proceedings of the 21st international conference on Compiler Construction, CC'12, pages 61--80, Berlin, Heidelberg, 2012. Springer-Verlag. Google ScholarDigital Library
Meng Qi, Thanh-Tung Cao, and Tiow-Seng Tan. Computing 2d constrained delaunay triangulation using the gpu. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, I3D '12, pages 39--46, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
Jonathan Richard Shewchuk. Triangle: Engineering a 2d quality mesh generator and Delaunay triangulator. In Applied Computational Geometry: Towards Geometric Engineering, volume 1148 of Lecture Notes in Computer Science, pages 203--222. Springer-Verlag, 1996. Google ScholarDigital Library
Jeff A. Stuart and John D. Owens. Efficient synchronization primitives for gpus. CoRR, abs/1110.4623, 2011.Google Scholar
Vibhav Vineet, Pawan Harish, Suryakant Patidar, and P. J. Narayanan. Fast minimum spanning tree for large graphs on the gpu. In Proceedings of the Conference on High Performance Graphics 2009, HPG '09, pages 167--171, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
Shucai Xiao and Wu chun Feng. Inter-block gpu communication via fast barrier synchronization. In IPDPS, pages 1--12. IEEE, 2010.Google Scholar
Andy Yoo, Edmond Chow, Keith Henderson, William McLendon, Bruce Hendrickson, and Umit Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, SC '05, pages 25--, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '11, pages 369--380, New York, NY, USA, 2011. ACM. Google ScholarDigital Library

Index Terms

Morph algorithms on GPUs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Morph algorithms on GPUs
PPoPP '13

There is growing interest in using GPUs to accelerate graph algorithms such as breadth-first search, computing page-ranks, and finding shortest paths. However, these algorithms do not modify the graph structure, so their implementation is relatively ...
Read More
A GPU implementation of inclusion-based points-to analysis
PPOPP '12

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate ...
Read More
A GPU implementation of inclusion-based points-to analysis
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
February 2013
332 pages
ISBN:9781450319225
DOI:10.1145/2442516
General Chairs:
Alex Nicolau
University of California, Irvine, USA
,
Xiaowei Shen
IBM Research, China
,
Program Chairs:
Saman Amarasinghe
Massachusetts Institute of Technology, USA
,
Richard Vuduc
Georgia Institute of Technology, USA
ACM SIGPLAN Notices Volume 48, Issue 8
PPoPP '13
August 2013
309 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2517327
Issue’s Table of Contents
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 February 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
boruvka
cuda
delaunay mesh refinement
gpu
graph algorithms
irregular programs
minimum spanning tree
morph algorithms
points-to analysis
survey propagation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate230of1,014submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 72
  Total Citations
  View Citations
- 904
  Total Downloads
- Downloads (Last 12 months)39
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Morph algorithms on GPUs

PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Morph algorithms on GPUs

A GPU implementation of inclusion-based points-to analysis

A GPU implementation of inclusion-based points-to analysis