ABSTRACT
Autotuning is an important method for automatically exploring code optimizations. It may target low-level code optimizations, such as memory blocking, loop unrolling or memory prefetching, as well as high-level optimizations, such as placement of computation kernels on proper hardware devices, optimizing memory transfers between nodes or between accelerators and main memory.
In this paper, we introduce an autotuning method, which extends state-of-the-art low-level tuning of OpenCL or CUDA kernels towards more complex optimizations. More precisely, we introduce a Kernel Tuning Toolkit (KTT), which implements inter-kernel global optimizations, allowing to tune parameters affecting multiple kernels or also the host code. We demonstrate on practical examples, that with global kernel optimizations we are able to explore tuning options that are not possible if kernels are tuned separately. Moreover, our tuning strategies can take into account numerical accuracy across multiple kernel invocations and search for implementations within specific numerical error bounds.
- E. Bajrovic and S. Benkner. Automatic performance tuning of pipeline patterns for heterogeneous parallel architectures. In 2014 International Conference on Parallel and Distributed Processing, Techniques and Applications, 2014.Google Scholar
- E. Bajrovic, Mijakovic R., J. Dokulil, S. Benkner, and M. Gerndt. Tuning OpenCL applications with the periscope tuning framework. In 2016 49th Hawaii International Conference on System Sciences (HICSS), 2016. Google ScholarDigital Library
- J. Enmyren, U. Dastgeer, and C. W. Kessler. Towards a tunable multi-backend skeleton programming framework for multi-GPU systems. In MCC-3: Swedish Woekshop on Multicore Computing, 2010.Google Scholar
- T. L. Falch and A. C. Elster. Machine learning based auto-tuning for enhanced OpenCL performance portability. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015. Google ScholarDigital Library
- J. Filipovič, M. Madzin, J. Fousek, and L. Matyska. Optimizing CUDA code by kernel fusion: application on BLAS. The Journal of Supercomputing, 2015. Google ScholarDigital Library
- M. Frigo and S. G. Johnson. The design and implementation of fftw3. Proceedings of the IEEE, 93(2):216--231, 2005.Google ScholarCross Ref
- M. Gerndt, S. Benkner, E. César, C. Navarrete, E. Bajrovic, J. Dokulil, C. Guillén, R. Mijakovic, and A. Sikora. A multi-aspect online tuning framework for HPC applications. Software Quality Journal, 2017.Google Scholar
- S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes. In 2012 Innovative Parallel Computing (InPar), 2012.Google ScholarCross Ref
- D. Grewe and A. Lokhmotov. Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation. In Fourth Workshop on General Purpose Processing on Graphics Processing Units, 2011. Google ScholarDigital Library
- Y. Li, J. Dongarra, and S. Tomov. A note on auto-tuning GEMM for GPUs. In Proceedings of the 9th International Conference on Computational Science: Part I, 2009. Google ScholarDigital Library
- Y. Li, Y.-Q. Zhang, Y.-Q. Liu, G.-P. Long, and H.-P. Jia. MPFFT: An auto-tuning FFT library for OpenCL GPUs. Journal of Computer Science and Technology, 28(1):90--105, 2013.Google ScholarCross Ref
- K. Matsumotoi, N. Nakasato, and S. G. Sedukhin. Performance tuning of matrix multiplication in OpenCL on different GPUs and CPUs. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012. Google ScholarDigital Library
- R. Miceli, G. Civario, A. Sikora, E. César, M. Gerndt, H. Haitof, C. Navarrete, S. Benkner, M. Sandrieser, L. Morin, and F. Bodin. AutoTune: A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications, pages 328--342. Springer, 2013. Google ScholarDigital Library
- Y. L. Nelson, B. Bansal, M. Hall, A. Nakano, and K. Lerman. Model-guided performance tuning of parameter values: A case study with molecular dynamics visualization. In IEEE International Symposium on Parallel and Distributed Processing, 2008.Google Scholar
- C. Nugteren and V. Codreanu. CLTune: A generic auto-tuner for OpenCL kernels. In Proceedings of the IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2015. Google ScholarDigital Library
- M. Olšák, J. Filipovič, and M. Prokop. Fastgrid --- the accelerated autogrid potential maps generation for molecular docking. Computing and Informatics, 29(6+), 2012.Google Scholar
- Z. Pan and R. Eigenmann. Fast and effective orchestration of compiler optimizations for automatic performance tuning. In International Symposium on Code Generation and Optimization (CGO'06), 2006. Google ScholarDigital Library
- J.E.Stone, J. C.Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, 28(16), 2007.Google ScholarCross Ref
- A. Tiwari and J. K. Hollingsworth. Online adaptive code generation and tuning. In IEEE International Parallel Distributed Processing Symposium (IPDPS), 2011. Google ScholarDigital Library
- S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I. August. Compiler optimization-space exploration. In International Symposium on Code Generation and Optimization (CGO'03), 2003. Google ScholarDigital Library
- R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, 1998. Google ScholarDigital Library
Recommendations
Generating OpenCL C kernels from OpenACC
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014Hardware accelerators are now a common way to improve the performances of compute nodes. This performance improvement has a cost: applications need to be rewritten to take advantage of the new hardware. OpenACC is a set of compiler directives to target ...
Performance-portable autotuning of OpenCL kernels for convolutional layers of deep neural networks
MLHPC '16: Proceedings of the Workshop on Machine Learning in High Performance Computing EnvironmentsWe present a portable and highly-optimized Deep Neural Network (DNN) algorithm and its implementation techniques. Our approach is a novel combination of existing HPC techniques that methodically applies autotuning as well as data layout and low-level ...
Autotuning GEMM Kernels for the Fermi GPU
In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical ...
Comments