research-article

Autotuning of OpenCL Kernels with Global Optimizations

Authors:
Jiří Filipovič

Masaryk University, University of Vienna

Masaryk University, University of Vienna
View Profile

,
Filip Petrovič

Masaryk University

Masaryk University
View Profile

,
Siegfried Benkner

University of Vienna

University of Vienna
View Profile

ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC SystemsSeptember 2017Article No.: 2Pages 1–6https://doi.org/10.1145/3152821.3152877

Published:09 September 2017Publication History

ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems

Pages 1–6

ABSTRACT

Autotuning is an important method for automatically exploring code optimizations. It may target low-level code optimizations, such as memory blocking, loop unrolling or memory prefetching, as well as high-level optimizations, such as placement of computation kernels on proper hardware devices, optimizing memory transfers between nodes or between accelerators and main memory.

In this paper, we introduce an autotuning method, which extends state-of-the-art low-level tuning of OpenCL or CUDA kernels towards more complex optimizations. More precisely, we introduce a Kernel Tuning Toolkit (KTT), which implements inter-kernel global optimizations, allowing to tune parameters affecting multiple kernels or also the host code. We demonstrate on practical examples, that with global kernel optimizations we are able to explore tuning options that are not possible if kernels are tuned separately. Moreover, our tuning strategies can take into account numerical accuracy across multiple kernel invocations and search for implementations within specific numerical error bounds.

References

E. Bajrovic and S. Benkner. Automatic performance tuning of pipeline patterns for heterogeneous parallel architectures. In 2014 International Conference on Parallel and Distributed Processing, Techniques and Applications, 2014.Google Scholar
E. Bajrovic, Mijakovic R., J. Dokulil, S. Benkner, and M. Gerndt. Tuning OpenCL applications with the periscope tuning framework. In 2016 49th Hawaii International Conference on System Sciences (HICSS), 2016. Google ScholarDigital Library
J. Enmyren, U. Dastgeer, and C. W. Kessler. Towards a tunable multi-backend skeleton programming framework for multi-GPU systems. In MCC-3: Swedish Woekshop on Multicore Computing, 2010.Google Scholar
T. L. Falch and A. C. Elster. Machine learning based auto-tuning for enhanced OpenCL performance portability. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015. Google ScholarDigital Library
J. Filipovič, M. Madzin, J. Fousek, and L. Matyska. Optimizing CUDA code by kernel fusion: application on BLAS. The Journal of Supercomputing, 2015. Google ScholarDigital Library
M. Frigo and S. G. Johnson. The design and implementation of fftw3. Proceedings of the IEEE, 93(2):216--231, 2005.Google ScholarCross Ref
M. Gerndt, S. Benkner, E. César, C. Navarrete, E. Bajrovic, J. Dokulil, C. Guillén, R. Mijakovic, and A. Sikora. A multi-aspect online tuning framework for HPC applications. Software Quality Journal, 2017.Google Scholar
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes. In 2012 Innovative Parallel Computing (InPar), 2012.Google ScholarCross Ref
D. Grewe and A. Lokhmotov. Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation. In Fourth Workshop on General Purpose Processing on Graphics Processing Units, 2011. Google ScholarDigital Library
Y. Li, J. Dongarra, and S. Tomov. A note on auto-tuning GEMM for GPUs. In Proceedings of the 9th International Conference on Computational Science: Part I, 2009. Google ScholarDigital Library
Y. Li, Y.-Q. Zhang, Y.-Q. Liu, G.-P. Long, and H.-P. Jia. MPFFT: An auto-tuning FFT library for OpenCL GPUs. Journal of Computer Science and Technology, 28(1):90--105, 2013.Google ScholarCross Ref
K. Matsumotoi, N. Nakasato, and S. G. Sedukhin. Performance tuning of matrix multiplication in OpenCL on different GPUs and CPUs. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012. Google ScholarDigital Library
R. Miceli, G. Civario, A. Sikora, E. César, M. Gerndt, H. Haitof, C. Navarrete, S. Benkner, M. Sandrieser, L. Morin, and F. Bodin. AutoTune: A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications, pages 328--342. Springer, 2013. Google ScholarDigital Library
Y. L. Nelson, B. Bansal, M. Hall, A. Nakano, and K. Lerman. Model-guided performance tuning of parameter values: A case study with molecular dynamics visualization. In IEEE International Symposium on Parallel and Distributed Processing, 2008.Google Scholar
C. Nugteren and V. Codreanu. CLTune: A generic auto-tuner for OpenCL kernels. In Proceedings of the IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2015. Google ScholarDigital Library
M. Olšák, J. Filipovič, and M. Prokop. Fastgrid --- the accelerated autogrid potential maps generation for molecular docking. Computing and Informatics, 29(6+), 2012.Google Scholar
Z. Pan and R. Eigenmann. Fast and effective orchestration of compiler optimizations for automatic performance tuning. In International Symposium on Code Generation and Optimization (CGO'06), 2006. Google ScholarDigital Library
J.E.Stone, J. C.Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, 28(16), 2007.Google ScholarCross Ref
A. Tiwari and J. K. Hollingsworth. Online adaptive code generation and tuning. In IEEE International Parallel Distributed Processing Symposium (IPDPS), 2011. Google ScholarDigital Library
S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I. August. Compiler optimization-space exploration. In International Symposium on Code Generation and Optimization (CGO'03), 2003. Google ScholarDigital Library
R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, 1998. Google ScholarDigital Library

Recommendations

Generating OpenCL C kernels from OpenACC
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014

Hardware accelerators are now a common way to improve the performances of compute nodes. This performance improvement has a cost: applications need to be rewritten to take advantage of the new hardware. OpenACC is a set of compiler directives to target ...
Read More
Performance-portable autotuning of OpenCL kernels for convolutional layers of deep neural networks
MLHPC '16: Proceedings of the Workshop on Machine Learning in High Performance Computing Environments

We present a portable and highly-optimized Deep Neural Network (DNN) algorithm and its implementation techniques. Our approach is a novel combination of existing HPC techniques that methodically applies autotuning as well as data layout and low-level ...
Read More
Autotuning GEMM Kernels for the Fermi GPU

In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems
September 2017
35 pages
ISBN:9781450353632
DOI:10.1145/3152821

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 September 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
ANDARE '17 Paper Acceptance Rate3of4submissions,75%Overall Acceptance Rate3of4submissions,75%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 176
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Autotuning of OpenCL Kernels with Global Optimizations

ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems

ABSTRACT

References

Cited By

Recommendations

Generating OpenCL C kernels from OpenACC

Performance-portable autotuning of OpenCL kernels for convolutional layers of deep neural networks

Autotuning GEMM Kernels for the Fermi GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Autotuning of OpenCL Kernels with Global Optimizations

ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems

ABSTRACT

References

Cited By

Recommendations

Generating OpenCL C kernels from OpenACC

Performance-portable autotuning of OpenCL kernels for convolutional layers of deep neural networks

Autotuning GEMM Kernels for the Fermi GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media