research-article

An Autotuning Protocol to Rapidly Build Autotuners

Authors:
Junhong Liu

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences
View Profile

,
Guangming Tan

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences
View Profile

,
Yulong Luo

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences
View Profile

,
Jiajia Li

Computational Science and Engineering, Georgia Institute of Technology

Computational Science and Engineering, Georgia Institute of Technology
View Profile

,
Zeyao Mo

Institute of Applied Physics and Computational Mathematics

Institute of Applied Physics and Computational Mathematics
View Profile

,
Ninghui Sun

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences
View Profile

Authors Info & Claims

ACM Transactions on Parallel Computing Volume 5 Issue 2Article No.: 9pp 1–25https://doi.org/10.1145/3291527

Published:04 January 2019Publication History

ACM Transactions on Parallel Computing

Abstract

Automatic performance tuning (Autotuning) is an increasingly critical tuning technique for the high portable performance of Exascale applications. However, constructing an autotuner from scratch remains a challenge, even for domain experts. In this work, we propose a performance tuning and knowledge management suite (PAK) to help rapidly build autotuners. In order to accommodate existing autotuning techniques, we present an autotuning protocol that is composed of an extractor, producer, optimizer, evaluator, and learner. To achieve modularity and reusability, we also define programming interfaces for each protocol component as the fundamental infrastructure, which provides a customizable mechanism to deploy knowledge mining in the performance database. PAK’s usability is demonstrated by studying two important computational kernels: stencil computation and sparse matrix-vector multiplication (SpMV). Our proposed autotuner based on PAK shows comparable performance and higher productivity than traditional autotuners by writing just a few tens of code using our autotuning protocol.

References

Ayaz Ali, Lennart Johnsson, and Jaspal Subhlok. 2007. Scheduling FFT computation on SMP and multicore systems. In Proceedings of the 21st Annual International Conference on Supercomputing. ACM, 293--301. Google ScholarDigital Library
Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, Vol. 44. ACM. Google ScholarDigital Library
Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In International Conference on Parallel Architectures and Compilation Techniques. http://groups.csail.mit.edu/commit/papers/2014/ansel-pact14-opentuner.pdf Google ScholarDigital Library
P. Basu, M. Hall, M. Khan, S. Maindola, S. Muralidharan, S. Ramalingam, A. Rivera, M. Shantharam, and A. Venkat. 2013a. Towards making autotuning mainstream. The International Journal of High Performance Computing Applications 27, 4 (2013), 379--393. Google ScholarDigital Library
P. Basu, A. Venkat, M. Hall, S. Williams, B. Van Straalen, and L. Oliker. 2013b. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid. In Proceedings of the 20th Annual International Conference on High Performance Computing. 452--461.Google Scholar
Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Phillip Colella, and Mary Hall. 2017. Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers. Parallel Computing 64 (2017), 50--64. http://www.sciencedirect.com/science/article/pii/S0167819117300376 High-End Computing for Next-Generation Scientific Discovery. Google ScholarDigital Library
Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. 1997. Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In Proceedings of the 11th International Conference on Supercomputing. ACM, 340--347. Google ScholarDigital Library
Cy Chan, Jason Ansel, Yee Lok Wong, Saman Amarasinghe, and Alan Edelman. 2009. Autotuning multigrid with PetaBricks. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, Article 5, 12 pages. Google ScholarDigital Library
Chun Chen, Jacqueline Chame, and Mary Hall. 2005. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In International Symposium on Code Generation and Optimization. IEEE, 111--122. Google ScholarDigital Library
Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Vol. 45. ACM, 115--126. Google ScholarDigital Library
Matthias Christen, Olaf Schenk, and Yifeng Cui. 2012. Patus for convenient high-performance stencils: Evaluation in earthquake simulations. In Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE, 1--10. Google ScholarDigital Library
Keith D. Cooper, Philip J. Schielke, and Devika Subramanian. 1999. Optimizing for reduced code space using genetic algorithms. ACM SIGPLAN Notices 34, 7 (1999), 1--9. Google ScholarDigital Library
Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2009. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review 51, 1 (2009), 129--159. Google ScholarDigital Library
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE Press, 4. Google ScholarDigital Library
Matteo Frigo and Steven G. Johnson. 2005. The design and implementation of FFTW3. Proceedings of the IEEE 93, 2 (2005), 216--231.Google ScholarCross Ref
Archana Ganapathi, Kaushik Datta, Armando Fox, and David Patterson. 2009. A case for machine learning to optimize multicore performance. In Proceedings of the 1st USENIX Workshop on Hot Topics in Parallelism (HotPar’09). Google ScholarDigital Library
Albert Hartono, Boyana Norris, and Ponnuswamy Sadayappan. 2009. Annotation-based empirical performance tuning using Orio. In Proceedings of the IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS’09). IEEE, 1--11. Google ScholarDigital Library
Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A stencil compiler for short-vector simd architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM, 13--24. Google ScholarDigital Library
R. Himeno. 2011. Himeno benchmark. Retrieved from http://accc.riken.jp/2444.htm.Google Scholar
K. Hou, H. Wang, and W. C. Feng. 2016. AAlign: A SIMD framework for pairwise sequence alignment on x86-based multi-and many-core processors. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS’16). 780--789.Google Scholar
Kaixi Hou, Hao Wang, and Wu-chun Feng. 2017. GPU-UniCache: Automatic code generation of spatial blocking for stencils on GPUs. In Proceedings of the Computing Frontiers Conference (CF’17). ACM, New York, 107--116. Google ScholarDigital Library
K. Hou, H. Wang, and W. C. Feng. 2018. A framework for the automatic vectorization of parallel sort on x86-based processors. IEEE Transactions on Parallel and Distributed Systems 29, 5 (2018), 958--972.Google ScholarCross Ref
Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, and Samuel Williams. 2010. An auto-tuning framework for parallel multicore stencil computations. In Proceedings of the 2010 IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS’10). IEEE, 1--12.Google ScholarCross Ref
Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, and Jacqueline Chame. 2013. A script-based autotuning compiler system to generate high-performance CUDA code. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 31. Google ScholarDigital Library
Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423.Google ScholarDigital Library
Karl S. Kunz and Raymond J. Luebbers. 1993. The Finite Difference Time Domain Method for Electromagnetics. CRC Press.Google Scholar
Ang Li, Weifeng Liu, Mads R. B. Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song. 2017b. Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). 26:1--26:14. Google ScholarDigital Library
J. Li, J. Choi, I. Perros, J. Sun, and R. Vuduc. 2017a. Model-driven sparse CP decomposition for higher-order tensors. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). 1048--1057.Google Scholar
Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication. In ACM SIGPLAN Notices, Vol. 48. ACM, 117--126. Google ScholarDigital Library
Junhong Liu, Xin He, Weifeng Liu, and Guangming Tan. 2018. Register-based implementation of the sparse general matrix-matrix multiplication on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, 407--408. Google ScholarDigital Library
Junhong Liu, Guangming Tan, Yulong Luo, Zeyao Mo, and Ninghui Sun. 2015. PAK. Retrieved from https://github.com/PAA-NCIC/PAK.Google Scholar
Weifeng Liu. 2015. Parallel and Scalable Sparse Basic Linear Algebra Subprograms. Ph.D. dissertation. University of Copenhagen.Google Scholar
Weifeng Liu, Ang Li, Jonathan D. Hogg, Iain S. Duff, and Brian Vinter. 2017. Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurrency and Computation: Practice and Experience 29, 21 (2017), e4244--n/a.Google ScholarCross Ref
Weifeng Liu and Brian Vinter. 2015a. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, 339--350. Google ScholarDigital Library
Weifeng Liu and Brian Vinter. 2015b. A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors. Journal of Parallel and Distributed Computing 85, C (Nov. 2015), 47--61. Google ScholarDigital Library
Yulong Luo, Guangming Tan, Zeyao Mo, and Ninghui Sun. 2015. FAST: A fast stencil autotuning framework based on an optimal-solution space model. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, New York, 187--196. Google ScholarDigital Library
Thibaut Lutz, Christian Fensch, and Murray Cole. 2013. Partans: An autotuning framework for stencil computation on multi-gpu systems. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 59. Google ScholarDigital Library
Allen D. Malony, Jan Cuny, and Sameer Shende. 1999. TAU: Tuning and Analysis Utilities. Technical Report. LALP-99--205. Los Alamos National Laboratory Publication.Google Scholar
Azamat Mametjanov, Daniel Lowell, Ching-Chen Ma, and Boyana Norris. 2012. Autotuning stencil-based computations on GPUs. In Proceedings of the 2012 IEEE International Conference on Cluster Computing (CLUSTER’12). IEEE, 266--274. Google ScholarDigital Library
Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). IEEE, 1--12. Google ScholarDigital Library
Christen Matthias, Schenk Olaf, and Burkhart Helmar. 2011. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). IEEE, 676--687. Google ScholarDigital Library
Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the 23rd International Conference on Supercomputing. ACM, 256--265. Google ScholarDigital Library
Paulius Micikevicius. 2009. 3D finite difference computation on GPUs using CUDA. In Proceedings of the 2nd Workshop on General Purpose Processing on Graphics Processing Units. ACM, 79--84. Google ScholarDigital Library
Saurav Muralidharan, Michael Garland, Albert Sidelnik, and Mary Hall. 2016. Designing a tunable nested data-parallel programming system. ACM Transactions on Architecture and Code Optimization 13, 447 (Dec. 2016), Article, 24 pages. Google ScholarDigital Library
Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1--13. Google ScholarDigital Library
Apan Qasem, Michael Jason Cade, and Dan Tamir. 2012. Improved energy efficiency for multithreaded kernels through model-based autotuning. In Proceedings of the 2012 IEEE Green Technologies Conference. IEEE, 1--6.Google ScholarCross Ref
Guangming Tan, Junhong Liu, and Jiajia Li. 2018. Design and implementation of adaptive SpMV library for multicore and manycore architecture. ACM Transactions on Mathematical Software 44, 4 (Aug. 2018), Article 46. Google ScholarDigital Library
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, 117--128. Google ScholarDigital Library
Cristian Ţăpuş, I-Hsin Chung, Jeffrey K. Hollingsworth, et al. 2002. Active harmony: Towards automated performance tuning. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE Computer Society Press, 1--11. Google ScholarDigital Library
A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. 2009. A scalable autotuning framework for compiler optimization. In Proceedings of the International Parallel and Distributed Processing Symposium. Google ScholarDigital Library
Didem Unat, Xing Cai, and Scott B. Baden. 2011. Mint: Realizing CUDA performance in 3D stencil methods with annotated C. In Proceedings of the International Conference on Supercomputing. ACM, 214--224. Google ScholarDigital Library
Richard Vuduc, James W. Demmel, and Katherine A. Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. In Journal of Physics: Conference Series, Vol. 16. IOP Publishing, 521.Google Scholar
Hao Wang, Weifeng Liu, Kaixi Hou, and Wu-chun Feng. 2016. Parallel transposition of sparse data structures. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). 33:1--33:13. Google ScholarDigital Library
H. Wang, J. Zhang, D. Zhang, S. Pumma, and W. C. Feng. 2017. PaPar: A parallel data partitioning framework for big data applications. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). 605--614.Google Scholar
Xinliang Wang, Weifeng Liu, Wei Xue, and Li Wu. 2018. swSpTRSV: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, 338--353. Google ScholarDigital Library
R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, 1--27. Google ScholarDigital Library
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 4 (2009), 65--76. Google ScholarDigital Library
Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018. Bridging the gap between deep learning and sparse matrix format selection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, 94--108. Google ScholarDigital Library

Index Terms

An Autotuning Protocol to Rapidly Build Autotuners
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Development frameworks and environments
      1. Application specific development environments

Recommendations

clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Sparse matrix vector multiplication (SpMV) kernel is a key computation in linear algebra. Most iterative methods are composed of SpMV operations with BLAS1 updates. Therefore, researchers make extensive efforts to optimize the SpMV kernel in sparse ...
Read More
Scientific computing Kernels on the cell processor

In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end scientific computing systems. Our work contains several novel contributions. First, we introduce a performance model for ...
Read More
Explicit Fourth-Order Runge---Kutta Method on Intel Xeon Phi Coprocessor

This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge---Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Parallel Computing Volume 5, Issue 2
June 2018
113 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3299751
Editor:
David Bader
Georgia Institute of Technology, USA
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 January 2019
- Accepted: 1 June 2018
- Revised: 1 January 2018
- Received: 1 January 2017
Published in topc Volume 5, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Autotuner
SpMV
knowledge database
protocol
stencil
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 183
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An Autotuning Protocol to Rapidly Build Autotuners

ACM Transactions on Parallel Computing

Abstract

References

Cited By

Index Terms

Recommendations

clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs

Scientific computing Kernels on the cell processor

Explicit Fourth-Order Runge---Kutta Method on Intel Xeon Phi Coprocessor