skip to main content
10.1145/3152821.3152877acmotherconferencesArticle/Chapter ViewAbstractPublication PagesandareConference Proceedingsconference-collections
research-article

Autotuning of OpenCL Kernels with Global Optimizations

Published:09 September 2017Publication History

ABSTRACT

Autotuning is an important method for automatically exploring code optimizations. It may target low-level code optimizations, such as memory blocking, loop unrolling or memory prefetching, as well as high-level optimizations, such as placement of computation kernels on proper hardware devices, optimizing memory transfers between nodes or between accelerators and main memory.

In this paper, we introduce an autotuning method, which extends state-of-the-art low-level tuning of OpenCL or CUDA kernels towards more complex optimizations. More precisely, we introduce a Kernel Tuning Toolkit (KTT), which implements inter-kernel global optimizations, allowing to tune parameters affecting multiple kernels or also the host code. We demonstrate on practical examples, that with global kernel optimizations we are able to explore tuning options that are not possible if kernels are tuned separately. Moreover, our tuning strategies can take into account numerical accuracy across multiple kernel invocations and search for implementations within specific numerical error bounds.

References

  1. E. Bajrovic and S. Benkner. Automatic performance tuning of pipeline patterns for heterogeneous parallel architectures. In 2014 International Conference on Parallel and Distributed Processing, Techniques and Applications, 2014.Google ScholarGoogle Scholar
  2. E. Bajrovic, Mijakovic R., J. Dokulil, S. Benkner, and M. Gerndt. Tuning OpenCL applications with the periscope tuning framework. In 2016 49th Hawaii International Conference on System Sciences (HICSS), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Enmyren, U. Dastgeer, and C. W. Kessler. Towards a tunable multi-backend skeleton programming framework for multi-GPU systems. In MCC-3: Swedish Woekshop on Multicore Computing, 2010.Google ScholarGoogle Scholar
  4. T. L. Falch and A. C. Elster. Machine learning based auto-tuning for enhanced OpenCL performance portability. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Filipovič, M. Madzin, J. Fousek, and L. Matyska. Optimizing CUDA code by kernel fusion: application on BLAS. The Journal of Supercomputing, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Frigo and S. G. Johnson. The design and implementation of fftw3. Proceedings of the IEEE, 93(2):216--231, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  7. M. Gerndt, S. Benkner, E. César, C. Navarrete, E. Bajrovic, J. Dokulil, C. Guillén, R. Mijakovic, and A. Sikora. A multi-aspect online tuning framework for HPC applications. Software Quality Journal, 2017.Google ScholarGoogle Scholar
  8. S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes. In 2012 Innovative Parallel Computing (InPar), 2012.Google ScholarGoogle ScholarCross RefCross Ref
  9. D. Grewe and A. Lokhmotov. Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation. In Fourth Workshop on General Purpose Processing on Graphics Processing Units, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Li, J. Dongarra, and S. Tomov. A note on auto-tuning GEMM for GPUs. In Proceedings of the 9th International Conference on Computational Science: Part I, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Li, Y.-Q. Zhang, Y.-Q. Liu, G.-P. Long, and H.-P. Jia. MPFFT: An auto-tuning FFT library for OpenCL GPUs. Journal of Computer Science and Technology, 28(1):90--105, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  12. K. Matsumotoi, N. Nakasato, and S. G. Sedukhin. Performance tuning of matrix multiplication in OpenCL on different GPUs and CPUs. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Miceli, G. Civario, A. Sikora, E. César, M. Gerndt, H. Haitof, C. Navarrete, S. Benkner, M. Sandrieser, L. Morin, and F. Bodin. AutoTune: A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications, pages 328--342. Springer, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. L. Nelson, B. Bansal, M. Hall, A. Nakano, and K. Lerman. Model-guided performance tuning of parameter values: A case study with molecular dynamics visualization. In IEEE International Symposium on Parallel and Distributed Processing, 2008.Google ScholarGoogle Scholar
  15. C. Nugteren and V. Codreanu. CLTune: A generic auto-tuner for OpenCL kernels. In Proceedings of the IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Olšák, J. Filipovič, and M. Prokop. Fastgrid --- the accelerated autogrid potential maps generation for molecular docking. Computing and Informatics, 29(6+), 2012.Google ScholarGoogle Scholar
  17. Z. Pan and R. Eigenmann. Fast and effective orchestration of compiler optimizations for automatic performance tuning. In International Symposium on Code Generation and Optimization (CGO'06), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J.E.Stone, J. C.Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, 28(16), 2007.Google ScholarGoogle ScholarCross RefCross Ref
  19. A. Tiwari and J. K. Hollingsworth. Online adaptive code generation and tuning. In IEEE International Parallel Distributed Processing Symposium (IPDPS), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I. August. Compiler optimization-space exploration. In International Symposium on Code Generation and Optimization (CGO'03), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems
    September 2017
    35 pages
    ISBN:9781450353632
    DOI:10.1145/3152821

    Copyright © 2017 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 9 September 2017

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    ANDARE '17 Paper Acceptance Rate3of4submissions,75%Overall Acceptance Rate3of4submissions,75%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader