ABSTRACT
General-purpose graphics processing units (GPGPU) brings an opportunity to improve the performance for many applications. However, exploiting parallelism is low productive in current programming frameworks such as CUDA and OpenCL. Programmers have to consider and deal with many GPGPU architecture details; therefore it is a challenge to trade off the programmability and the efficiency of performance tuning.
Parallel Repacking (PR) is a popular performance tuning approach for GPGPU applications, which improves the performance by changing the parallel granularity. Existing code transformation algorithms using PR increase the productivity, but they do not cover adequate code patterns and do not give an effective code error detection. In this paper, we propose a novel parallel repacking algorithm (APR) to cover a wide range of code patterns and improve efficiency. We develop an efficient code model that expresses a GPGPU program as a recursive statement sequence, and introduces a concept of singular statement. APR building upon this model uses appropriate transformation rules for singular and non-singular statements to generate the repacked codes. A recursive transformation is performed when it encounters a branching/loop singular statement. Additionally, singular statements unify the transformation for barriers and data sharing, and enable APR to detect the barrier errors. The experiment results based on a prototype show that out proposed APR covers more code patterns than existing solutions such as the automatic thread coarsening in Crest, and the repacked codes using the APR achieve effective performance gain up to 3.28X speedup, in some cases even higher than manually tuned repacked codes.
- A. Magni, C. Dubach, and M. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In SC, pages 11.1--11, 2013. Google ScholarDigital Library
- C. Nugteren and H. Corporaal. Introducing 'Bones': A parallelizing source to source compplier based on algorithmic skeletons. In GPGPU, pages 1--10, 2012. Google ScholarDigital Library
- CRI lab. Par4All, 2011.Google Scholar
- D. Unat, J. Zhou, Y. Gui, et al. Accelerating a 3D finite-difference earthquake simulation with a C-to-CUDA translator. Computing in Science & Engineering, 14(3):48--59, 2012. Google ScholarDigital Library
- D. Unat, X. Cai, and S. Baden. Mint: Realizing CUDA performance in 3D stencil methods with annotated C. In ICS, pages 214--224, 2011. Google ScholarDigital Library
- G.S. Murthy, M. Ravishankar, M.M. Baskaran, et al. Optimal loop unrolling for GPGPU programs. In ISPA, pages 1--11, 2010.Google ScholarCross Ref
- J. Enmyren and C.W. Kessler. SkePU: A multi-backend skeleton programming library for multi-GPU system. In HLPP, pages 5--14, 2010. Google ScholarDigital Library
- L. Chen, O. Villa, S. Krishnamoorthy, et al. Dynamic load balancing on single- and multi-GPU systems. In ISPA, pages 1--12, 2010.Google ScholarCross Ref
- M. Khan, P. Basu, G. Rudy, et al. A script-based autotuning compiler system to generate high-performance CUDA code. ACM Transactions on Architecture and Code Optimization, 9(4):31.1--25, 2013. Google ScholarDigital Library
- M.M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In International Conference on Compiler Construction, pages 244--263, 2010. Google ScholarDigital Library
- M.M. Baskaran, U. Bondhugula, and S. Krishnamoorthy. A compiler framework for optimization of affine loop nests for GPGPUs. In International Conference on Supercomputing, pages 225--234, 2008. Google ScholarDigital Library
- N. Rudy, M.M. Khan, M. Hall, et al. A programming language interface to describe transformations and code generation. In LCPC, pages 136--150, 2010. Google ScholarDigital Library
- NVIDIA. CUDA C/C++ SDK code samples, 2011.Google Scholar
- NVIDIA. CUBLAS, 2012.Google Scholar
- NVIDIA. CUDA C Programming Guide, 2012.Google Scholar
- NVIDIA. NVIDIA Performance Primitives, 2012.Google Scholar
- NVIDIA. Thrust, 2012.Google Scholar
- OpenACC Members. OpenACC Home, 2013.Google Scholar
- R. Farber. CUDA Application Design and Development. Elservier, 2012. Google ScholarDigital Library
- S. Baghdadi, A. Größlinger, and A. Cohen. Putting automatic polyhedral compilation for GPGPU to work. In CPC, pages 1--12, 2010.Google Scholar
- S. Che, J. Sheaffer, and K. Skadron. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC, pages 1--11, 2011. Google ScholarDigital Library
- S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP programming and tuning for GPUs. In SC, pages 1--11, 2010. Google ScholarDigital Library
- S. Ryoo, C.I. Rodrigues, S.S. Stone, et al. Program optimization carving for GPU computing. Journal of Parallel and Distributed Computing, 68(10):1389--1401, 2008. Google ScholarDigital Library
- S. Sengupta, M. Harris, and M. Garland. Efficent parallel scan algorithms for GPUs. Technical Report NVR-2008-003, NVIDIA, 2008.Google Scholar
- S. Unkule, C. Shaltz, and A. Qasem. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality. In CC, pages 21--40, 2012. Google ScholarDigital Library
- S. Verdoolaege, J.C. Juega, A. Cohen, et al. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization, 9(4):54.1--25, 2013. Google ScholarDigital Library
- S.Z. Ueng, M. Lathara, S.S. Baghsorkhi, et al. CUDA-Lite: Reducing GPU programming complexity. In LCPC, pages 1--15, 2008. Google ScholarDigital Library
- T. Han and T. Abdelrahman. hiCUDA: High-level GPGPU programming. IEEE Transactions on Parallel and Distributed Systems, 22(1):78--90, 2011. Google ScholarDigital Library
- The Khronos Group. The open standard for parallel programming of heterogeneous systems, 2013.Google Scholar
- V. Volkov. Unrolling parallel loops. Tutorial in SC, 2011.Google Scholar
- X.L. Wu, N. Obeid, and W.M. Hwu. Exploiting more parallelism from applications having generalized reductions on GPU architectures. In ICCIT, pages 1175--1180, 2010. Google ScholarDigital Library
- Y. Allusse, P. Horain, A. Agarwal, et al. GpuCV: An open source GPU-accelerated framework for image processing and computer vision. In International Conference on Multimedia, pages 1089--1092, 2008. Google ScholarDigital Library
Index Terms
- APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation
Recommendations
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
A Study of the Performance of Multifluid PPM Gas Dynamics on CPUs and GPUs
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe potential for GPUs and many-core CPUs to support high performance computation in the area of computational fluid dynamics (CFD) is explored quantitatively through the example of the PPM gas dynamics code with PPB multi fluid volume fraction ...
OpenCL: Make Ubiquitous Supercomputing Possible
HPCC '10: Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and CommunicationsDue to the dramatic requirements of 3D games and applications, graphics processing unit (GPU) or general-purpose graphics processing unit (GPGPU) have become required components in the modern computer systems. While these devices enable high parallelism ...
Comments