research-article

OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Authors:

Rudolf EigenmannAuthors Info & Claims

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 101 - 110

https://doi.org/10.1145/1504176.1504194

Published: 14 February 2009 Publication History

Abstract

GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming GPGPUs is still complex and error-prone. This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications. The goal of this translation is to further improve programmability and make existing OpenMP applications amenable to execution on GPGPUs. In this paper, we have identified several key transformation techniques, which enable efficient GPU global memory access, to achieve high performance. Experimental results from two important kernels (JACOBI and SPMUL) and two NAS OpenMP Parallel Benchmarks (EP and CG) show that the described translator and compile-time optimizations work well on both regular and irregular applications, leading to performance improvements of up to 50X over the unoptimized translation (up to 328X over serial).

References

[1]

Randy Allen and Ken Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4):491--542, October 1987,

Digital Library

[2]

M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for GPGPUs. ACM International Conference on Supercomputing (ICS), 2008.

Digital Library

[3]

Ayon Basumallik and Rudolf Eigenmann. Towards automatic translation of OpenMP to MPI. ACM International Conference on Supercomputing (ICS), pages 189--198, 2005.

Digital Library

[4]

NVIDIA CUDA {online}. available: http://developer.nvidia.com/object/cuda home.html.

[5]

NVIDIA CUDA SDK - Data-Parallel Algorithms: Parallel Reduction {online}. available: http://developer.download.nvidia.com/compute/cuda/1 1/Website/Data-Parallel Algorithms.html.

[6]

Tim Davis. University of Florida Sparse Matrix Collection {online}. available: http://www.cise.ufl.edu/research/sparse/matrices/.

Digital Library

[7]

N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model for scientific algorithms on graphics processors. International Conference for High Performance Computing, Networking, Storage and Analysys (SC), 2006.

Digital Library

[8]

Sang Ik Lee, Troy Johnson, and Rudolf Eigenmann. Cetus - an extensible compiler infrastructure for source-to-source transformation. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2003.

[9]

David Levine, David Callahan, and Jack Dongarra. A comparative study of automatic vectorizing compilers. Parallel Computing, 17, 1991.

[10]

Seung-Jai Min, Ayon Basumallik, and Rudolf Eigenmann. Optimizing OpenMP programs on software distributed shared memory systems. International Journel of Parallel Programming (IJPP), 31:225--249, June 2003.

Digital Library

[11]

Seung-Jai Min and Rudolf Eigenmann. Optimizing irregular shared-memory applications for clusters. ACM International Conference on Supercomputing (ICS), pages 256--265, 2008.

Digital Library

[12]

K. O'Brien, K. O'Brien, Z. Sura, T. Chen, and T. Zhang. Supporting OpenMP on Cell. International Journel of Parallel Programming (IJPP), 36(3):289--311, June 2008.

Digital Library

[13]

OpenMP {online}. available: http://openmp.org/wp/.

[14]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 73--82, 2008.

Digital Library

[15]

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Program optimization space pruning for a multithreaded GPU. International Symposium on Code Generation and Optimization (CGO), 2008.

Digital Library

[16]

J. A. Stratton, S. S. Stone, and W. W. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008.

Digital Library

[17]

Narayanan Sundaram, Anand Raghunathan, and Srimat T. Chakradhar. A framework for efficient and scalable execution of domain-specific templates on GPUs. IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2009.

Digital Library

[18]

S. Ueng, M. Lathara, S. S. Baghsorkhi, and W. W. Hwu. CUDA-lite: Reducing GPU programming complexity. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008.

Digital Library

[19]

Haitao Wei and Junqing Yu. Mapping OpenMP to Cell: An effective compiler framework for heterogeneous multi-core chip. International Workshop on OpenMP (IWOMP), 2007.

Digital Library

[20]

Peng Wu, Alexandre E. Eichenberger, Amy Wang, and Peng Zhao. An integrated simdization framework using virtual vectors. ACM International Conference on Supercomputing (ICS), pages 169--178, 2005.

Digital Library

Cited By

Yamato Y(2024)Study and Evaluation for Adopting Environmental Adaptation of Low-Resource DevicesIEEE Access10.1109/ACCESS.2024.344091812(110447-110456)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3440918
Yamato Y(2024)Study and evaluation of automatic division of general-purpose programs to facilitate addition of user functionsInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2024.2375650(1-12)Online publication date: 9-Aug-2024
https://doi.org/10.1080/17445760.2024.2375650
Yamato Y(2024)Study and evaluation of automatic offloading for function blocks of applicationsAutomatika10.1080/00051144.2024.230188865:1(387-400)Online publication date: 9-Jan-2024
https://doi.org/10.1080/00051144.2024.2301888
Show More Cited By

Index Terms

OpenMP to GPGPU: a compiler framework for automatic translation and optimization
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

OpenMP to GPGPU: a compiler framework for automatic translation and optimization
PPoPP '09

GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming ...
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

February 2009

322 pages

ISBN:9781605583976

DOI:10.1145/1504176

General Chair:
Daniel Reed
Microsoft Research, USA
,
Program Chair:
Vivek Sarkar
Rice University, USA

ACM SIGPLAN Notices Volume 44, Issue 4
PPoPP '09
April 2009
294 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1594835
Issue’s Table of Contents

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 February 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP09

Sponsor:

PPoPP09: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 14 - 18, 2009

NC, Raleigh, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

349
Total Citations
View Citations
4,314
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yamato Y(2024)Study and Evaluation for Adopting Environmental Adaptation of Low-Resource DevicesIEEE Access10.1109/ACCESS.2024.344091812(110447-110456)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3440918
Yamato Y(2024)Study and evaluation of automatic division of general-purpose programs to facilitate addition of user functionsInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2024.2375650(1-12)Online publication date: 9-Aug-2024
https://doi.org/10.1080/17445760.2024.2375650
Yamato Y(2024)Study and evaluation of automatic offloading for function blocks of applicationsAutomatika10.1080/00051144.2024.230188865:1(387-400)Online publication date: 9-Jan-2024
https://doi.org/10.1080/00051144.2024.2301888
Mohamed KMohamed K(2024)An Introduction to Heterogeneous SoC Design and Verification “A Conceptual-Level”Heterogeneous SoC Design and Verification10.1007/978-3-031-56152-8_1(1-26)Online publication date: 23-Mar-2024
https://doi.org/10.1007/978-3-031-56152-8_1
Yamato Y(2023)Proposal and Evaluation of GPU Offloading Parts Reconfiguration During Applications Operations for Environment AdaptationJournal of Network and Systems Management10.1007/s10922-023-09789-232:1Online publication date: 28-Nov-2023
https://doi.org/10.1007/s10922-023-09789-2
Ozen GWolfe MEgger BSmith A(2022)Performant portable OpenMPProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517780(156-168)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517780
Yamato Y(2022)Study and evaluation of optimum location deployment for environment adaptive applicationsInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2022.208874937:5(528-541)Online publication date: 20-Jun-2022
https://doi.org/10.1080/17445760.2022.2088749
Yamato Y(2021)Study and evaluation of automatic GPU offloading method from various language applicationsInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2021.1971666(1-18)Online publication date: 6-Sep-2021
https://doi.org/10.1080/17445760.2021.1971666
Yamato Y(2021)Study and evaluation of improved automatic GPU offloading methodInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2021.1941010(1-15)Online publication date: 16-Jun-2021
https://doi.org/10.1080/17445760.2021.1941010
Senanayake RHong CWang ZWilson AChou SKamil SAmarasinghe SKjolstad F(2020)A sparse iteration space transformation framework for sparse tensor algebraProceedings of the ACM on Programming Languages10.1145/34282264:OOPSLA(1-30)Online publication date: 13-Nov-2020
https://dl.acm.org/doi/10.1145/3428226
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten