skip to main content
10.1145/2588768.2576789acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation

Authors Info & Claims
Published:01 March 2014Publication History

ABSTRACT

General-purpose graphics processing units (GPGPU) brings an opportunity to improve the performance for many applications. However, exploiting parallelism is low productive in current programming frameworks such as CUDA and OpenCL. Programmers have to consider and deal with many GPGPU architecture details; therefore it is a challenge to trade off the programmability and the efficiency of performance tuning.

Parallel Repacking (PR) is a popular performance tuning approach for GPGPU applications, which improves the performance by changing the parallel granularity. Existing code transformation algorithms using PR increase the productivity, but they do not cover adequate code patterns and do not give an effective code error detection. In this paper, we propose a novel parallel repacking algorithm (APR) to cover a wide range of code patterns and improve efficiency. We develop an efficient code model that expresses a GPGPU program as a recursive statement sequence, and introduces a concept of singular statement. APR building upon this model uses appropriate transformation rules for singular and non-singular statements to generate the repacked codes. A recursive transformation is performed when it encounters a branching/loop singular statement. Additionally, singular statements unify the transformation for barriers and data sharing, and enable APR to detect the barrier errors. The experiment results based on a prototype show that out proposed APR covers more code patterns than existing solutions such as the automatic thread coarsening in Crest, and the repacked codes using the APR achieve effective performance gain up to 3.28X speedup, in some cases even higher than manually tuned repacked codes.

References

  1. A. Magni, C. Dubach, and M. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In SC, pages 11.1--11, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Nugteren and H. Corporaal. Introducing 'Bones': A parallelizing source to source compplier based on algorithmic skeletons. In GPGPU, pages 1--10, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. CRI lab. Par4All, 2011.Google ScholarGoogle Scholar
  4. D. Unat, J. Zhou, Y. Gui, et al. Accelerating a 3D finite-difference earthquake simulation with a C-to-CUDA translator. Computing in Science & Engineering, 14(3):48--59, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Unat, X. Cai, and S. Baden. Mint: Realizing CUDA performance in 3D stencil methods with annotated C. In ICS, pages 214--224, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G.S. Murthy, M. Ravishankar, M.M. Baskaran, et al. Optimal loop unrolling for GPGPU programs. In ISPA, pages 1--11, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  7. J. Enmyren and C.W. Kessler. SkePU: A multi-backend skeleton programming library for multi-GPU system. In HLPP, pages 5--14, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Chen, O. Villa, S. Krishnamoorthy, et al. Dynamic load balancing on single- and multi-GPU systems. In ISPA, pages 1--12, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  9. M. Khan, P. Basu, G. Rudy, et al. A script-based autotuning compiler system to generate high-performance CUDA code. ACM Transactions on Architecture and Code Optimization, 9(4):31.1--25, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M.M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In International Conference on Compiler Construction, pages 244--263, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M.M. Baskaran, U. Bondhugula, and S. Krishnamoorthy. A compiler framework for optimization of affine loop nests for GPGPUs. In International Conference on Supercomputing, pages 225--234, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Rudy, M.M. Khan, M. Hall, et al. A programming language interface to describe transformations and code generation. In LCPC, pages 136--150, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. NVIDIA. CUDA C/C++ SDK code samples, 2011.Google ScholarGoogle Scholar
  14. NVIDIA. CUBLAS, 2012.Google ScholarGoogle Scholar
  15. NVIDIA. CUDA C Programming Guide, 2012.Google ScholarGoogle Scholar
  16. NVIDIA. NVIDIA Performance Primitives, 2012.Google ScholarGoogle Scholar
  17. NVIDIA. Thrust, 2012.Google ScholarGoogle Scholar
  18. OpenACC Members. OpenACC Home, 2013.Google ScholarGoogle Scholar
  19. R. Farber. CUDA Application Design and Development. Elservier, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Baghdadi, A. Größlinger, and A. Cohen. Putting automatic polyhedral compilation for GPGPU to work. In CPC, pages 1--12, 2010.Google ScholarGoogle Scholar
  21. S. Che, J. Sheaffer, and K. Skadron. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC, pages 1--11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP programming and tuning for GPUs. In SC, pages 1--11, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Ryoo, C.I. Rodrigues, S.S. Stone, et al. Program optimization carving for GPU computing. Journal of Parallel and Distributed Computing, 68(10):1389--1401, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Sengupta, M. Harris, and M. Garland. Efficent parallel scan algorithms for GPUs. Technical Report NVR-2008-003, NVIDIA, 2008.Google ScholarGoogle Scholar
  25. S. Unkule, C. Shaltz, and A. Qasem. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality. In CC, pages 21--40, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Verdoolaege, J.C. Juega, A. Cohen, et al. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization, 9(4):54.1--25, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S.Z. Ueng, M. Lathara, S.S. Baghsorkhi, et al. CUDA-Lite: Reducing GPU programming complexity. In LCPC, pages 1--15, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Han and T. Abdelrahman. hiCUDA: High-level GPGPU programming. IEEE Transactions on Parallel and Distributed Systems, 22(1):78--90, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. The Khronos Group. The open standard for parallel programming of heterogeneous systems, 2013.Google ScholarGoogle Scholar
  30. V. Volkov. Unrolling parallel loops. Tutorial in SC, 2011.Google ScholarGoogle Scholar
  31. X.L. Wu, N. Obeid, and W.M. Hwu. Exploiting more parallelism from applications having generalized reductions on GPU architectures. In ICCIT, pages 1175--1180, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Allusse, P. Horain, A. Agarwal, et al. GpuCV: An open source GPU-accelerated framework for image processing and computer vision. In International Conference on Multimedia, pages 1089--1092, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          GPGPU-7: Proceedings of Workshop on General Purpose Processing Using GPUs
          March 2014
          110 pages
          ISBN:9781450327664
          DOI:10.1145/2588768

          Copyright © 2014 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 March 2014

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          GPGPU-7 Paper Acceptance Rate12of27submissions,44%Overall Acceptance Rate57of129submissions,44%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader