research-article

APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation

Authors:
Yulong Yu

School of Software Technology, Dalian University of Technology, Dalian, China

School of Software Technology, Dalian University of Technology, Dalian, China
View Profile

,
Xubin He

Department of Electrical and Computer Engineering, Virginia Commonwealth University Richmond, VA, USA

Department of Electrical and Computer Engineering, Virginia Commonwealth University Richmond, VA, USA
View Profile

,
He Guo

School of Software Technology, Dalian University of Technology, Dalian, China

School of Software Technology, Dalian University of Technology, Dalian, China
View Profile

,
Sihui Zhong

School of Software Technology, Dalian University of Technology, Dalian, China

School of Software Technology, Dalian University of Technology, Dalian, China
View Profile

,
Yuxin Wang

School of Computer Science and Technology, Dalian University of Technology, Dalian, China

School of Computer Science and Technology, Dalian University of Technology, Dalian, China
View Profile

,
Xin Chen

School of Software Technology, Dalian University of Technology, Dalian, China

School of Software Technology, Dalian University of Technology, Dalian, China
View Profile

,
Weijun Xiao

Department of Electrical and Computer Engineering, Virginia Commonwealth University Richmond, VA, USA

Department of Electrical and Computer Engineering, Virginia Commonwealth University Richmond, VA, USA
View Profile

GPGPU-7: Proceedings of Workshop on General Purpose Processing Using GPUsMarch 2014Pages 81–89https://doi.org/10.1145/2588768.2576789

Published:01 March 2014Publication History

GPGPU-7: Proceedings of Workshop on General Purpose Processing Using GPUs

Pages 81–89

ABSTRACT

General-purpose graphics processing units (GPGPU) brings an opportunity to improve the performance for many applications. However, exploiting parallelism is low productive in current programming frameworks such as CUDA and OpenCL. Programmers have to consider and deal with many GPGPU architecture details; therefore it is a challenge to trade off the programmability and the efficiency of performance tuning.

Parallel Repacking (PR) is a popular performance tuning approach for GPGPU applications, which improves the performance by changing the parallel granularity. Existing code transformation algorithms using PR increase the productivity, but they do not cover adequate code patterns and do not give an effective code error detection. In this paper, we propose a novel parallel repacking algorithm (APR) to cover a wide range of code patterns and improve efficiency. We develop an efficient code model that expresses a GPGPU program as a recursive statement sequence, and introduces a concept of singular statement. APR building upon this model uses appropriate transformation rules for singular and non-singular statements to generate the repacked codes. A recursive transformation is performed when it encounters a branching/loop singular statement. Additionally, singular statements unify the transformation for barriers and data sharing, and enable APR to detect the barrier errors. The experiment results based on a prototype show that out proposed APR covers more code patterns than existing solutions such as the automatic thread coarsening in Crest, and the repacked codes using the APR achieve effective performance gain up to 3.28X speedup, in some cases even higher than manually tuned repacked codes.

References

A. Magni, C. Dubach, and M. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In SC, pages 11.1--11, 2013. Google ScholarDigital Library
C. Nugteren and H. Corporaal. Introducing 'Bones': A parallelizing source to source compplier based on algorithmic skeletons. In GPGPU, pages 1--10, 2012. Google ScholarDigital Library
CRI lab. Par4All, 2011.Google Scholar
D. Unat, J. Zhou, Y. Gui, et al. Accelerating a 3D finite-difference earthquake simulation with a C-to-CUDA translator. Computing in Science & Engineering, 14(3):48--59, 2012. Google ScholarDigital Library
D. Unat, X. Cai, and S. Baden. Mint: Realizing CUDA performance in 3D stencil methods with annotated C. In ICS, pages 214--224, 2011. Google ScholarDigital Library
G.S. Murthy, M. Ravishankar, M.M. Baskaran, et al. Optimal loop unrolling for GPGPU programs. In ISPA, pages 1--11, 2010.Google ScholarCross Ref
J. Enmyren and C.W. Kessler. SkePU: A multi-backend skeleton programming library for multi-GPU system. In HLPP, pages 5--14, 2010. Google ScholarDigital Library
L. Chen, O. Villa, S. Krishnamoorthy, et al. Dynamic load balancing on single- and multi-GPU systems. In ISPA, pages 1--12, 2010.Google ScholarCross Ref
M. Khan, P. Basu, G. Rudy, et al. A script-based autotuning compiler system to generate high-performance CUDA code. ACM Transactions on Architecture and Code Optimization, 9(4):31.1--25, 2013. Google ScholarDigital Library
M.M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In International Conference on Compiler Construction, pages 244--263, 2010. Google ScholarDigital Library
M.M. Baskaran, U. Bondhugula, and S. Krishnamoorthy. A compiler framework for optimization of affine loop nests for GPGPUs. In International Conference on Supercomputing, pages 225--234, 2008. Google ScholarDigital Library
N. Rudy, M.M. Khan, M. Hall, et al. A programming language interface to describe transformations and code generation. In LCPC, pages 136--150, 2010. Google ScholarDigital Library
NVIDIA. CUDA C/C++ SDK code samples, 2011.Google Scholar
NVIDIA. CUBLAS, 2012.Google Scholar
NVIDIA. CUDA C Programming Guide, 2012.Google Scholar
NVIDIA. NVIDIA Performance Primitives, 2012.Google Scholar
NVIDIA. Thrust, 2012.Google Scholar
OpenACC Members. OpenACC Home, 2013.Google Scholar
R. Farber. CUDA Application Design and Development. Elservier, 2012. Google ScholarDigital Library
S. Baghdadi, A. Größlinger, and A. Cohen. Putting automatic polyhedral compilation for GPGPU to work. In CPC, pages 1--12, 2010.Google Scholar
S. Che, J. Sheaffer, and K. Skadron. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC, pages 1--11, 2011. Google ScholarDigital Library
S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP programming and tuning for GPUs. In SC, pages 1--11, 2010. Google ScholarDigital Library
S. Ryoo, C.I. Rodrigues, S.S. Stone, et al. Program optimization carving for GPU computing. Journal of Parallel and Distributed Computing, 68(10):1389--1401, 2008. Google ScholarDigital Library
S. Sengupta, M. Harris, and M. Garland. Efficent parallel scan algorithms for GPUs. Technical Report NVR-2008-003, NVIDIA, 2008.Google Scholar
S. Unkule, C. Shaltz, and A. Qasem. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality. In CC, pages 21--40, 2012. Google ScholarDigital Library
S. Verdoolaege, J.C. Juega, A. Cohen, et al. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization, 9(4):54.1--25, 2013. Google ScholarDigital Library
S.Z. Ueng, M. Lathara, S.S. Baghsorkhi, et al. CUDA-Lite: Reducing GPU programming complexity. In LCPC, pages 1--15, 2008. Google ScholarDigital Library
T. Han and T. Abdelrahman. hiCUDA: High-level GPGPU programming. IEEE Transactions on Parallel and Distributed Systems, 22(1):78--90, 2011. Google ScholarDigital Library
The Khronos Group. The open standard for parallel programming of heterogeneous systems, 2013.Google Scholar
V. Volkov. Unrolling parallel loops. Tutorial in SC, 2011.Google Scholar
X.L. Wu, N. Obeid, and W.M. Hwu. Exploiting more parallelism from applications having generalized reductions on GPU architectures. In ICCIT, pages 1175--1180, 2010. Google ScholarDigital Library
Y. Allusse, P. Horain, A. Agarwal, et al. GpuCV: An open source GPU-accelerated framework for image processing and computer vision. In International Conference on Multimedia, pages 1089--1092, 2008. Google ScholarDigital Library

Index Terms

APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation
    2. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Read More
A Study of the Performance of Multifluid PPM Gas Dynamics on CPUs and GPUs
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The potential for GPUs and many-core CPUs to support high performance computation in the area of computational fluid dynamics (CFD) is explored quantitatively through the example of the PPM gas dynamics code with PPB multi fluid volume fraction ...
Read More
OpenCL: Make Ubiquitous Supercomputing Possible
HPCC '10: Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and Communications

Due to the dramatic requirements of 3D games and applications, graphics processing unit (GPU) or general-purpose graphics processing unit (GPGPU) have become required components in the modern computer systems. While these devices enable high parallelism ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GPGPU-7: Proceedings of Workshop on General Purpose Processing Using GPUs
March 2014
110 pages
ISBN:9781450327664
DOI:10.1145/2588768
Conference Chairs:
John Cavazos,
Xiang Gong,
David Kaeli
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Code transformation
GPGPU
Parallel programming
Parallel repacking
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
GPGPU-7 Paper Acceptance Rate12of27submissions,44%Overall Acceptance Rate57of129submissions,44%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 198
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation

GPGPU-7: Proceedings of Workshop on General Purpose Processing Using GPUs

ABSTRACT

References

Cited By

Index Terms

Recommendations

A performance study of general-purpose applications on graphics processors using CUDA

A Study of the Performance of Multifluid PPM Gas Dynamics on CPUs and GPUs

OpenCL: Make Ubiquitous Supercomputing Possible