skip to main content
research-article

An Autotuning Protocol to Rapidly Build Autotuners

Authors Info & Claims
Published:04 January 2019Publication History
Skip Abstract Section

Abstract

Automatic performance tuning (Autotuning) is an increasingly critical tuning technique for the high portable performance of Exascale applications. However, constructing an autotuner from scratch remains a challenge, even for domain experts. In this work, we propose a performance tuning and knowledge management suite (PAK) to help rapidly build autotuners. In order to accommodate existing autotuning techniques, we present an autotuning protocol that is composed of an extractor, producer, optimizer, evaluator, and learner. To achieve modularity and reusability, we also define programming interfaces for each protocol component as the fundamental infrastructure, which provides a customizable mechanism to deploy knowledge mining in the performance database. PAK’s usability is demonstrated by studying two important computational kernels: stencil computation and sparse matrix-vector multiplication (SpMV). Our proposed autotuner based on PAK shows comparable performance and higher productivity than traditional autotuners by writing just a few tens of code using our autotuning protocol.

References

  1. Ayaz Ali, Lennart Johnsson, and Jaspal Subhlok. 2007. Scheduling FFT computation on SMP and multicore systems. In Proceedings of the 21st Annual International Conference on Supercomputing. ACM, 293--301. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, Vol. 44. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In International Conference on Parallel Architectures and Compilation Techniques. http://groups.csail.mit.edu/commit/papers/2014/ansel-pact14-opentuner.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Basu, M. Hall, M. Khan, S. Maindola, S. Muralidharan, S. Ramalingam, A. Rivera, M. Shantharam, and A. Venkat. 2013a. Towards making autotuning mainstream. The International Journal of High Performance Computing Applications 27, 4 (2013), 379--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Basu, A. Venkat, M. Hall, S. Williams, B. Van Straalen, and L. Oliker. 2013b. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid. In Proceedings of the 20th Annual International Conference on High Performance Computing. 452--461.Google ScholarGoogle Scholar
  6. Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Phillip Colella, and Mary Hall. 2017. Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers. Parallel Computing 64 (2017), 50--64. http://www.sciencedirect.com/science/article/pii/S0167819117300376 High-End Computing for Next-Generation Scientific Discovery. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. 1997. Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In Proceedings of the 11th International Conference on Supercomputing. ACM, 340--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cy Chan, Jason Ansel, Yee Lok Wong, Saman Amarasinghe, and Alan Edelman. 2009. Autotuning multigrid with PetaBricks. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, Article 5, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chun Chen, Jacqueline Chame, and Mary Hall. 2005. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In International Symposium on Code Generation and Optimization. IEEE, 111--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Vol. 45. ACM, 115--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Matthias Christen, Olaf Schenk, and Yifeng Cui. 2012. Patus for convenient high-performance stencils: Evaluation in earthquake simulations. In Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Keith D. Cooper, Philip J. Schielke, and Devika Subramanian. 1999. Optimizing for reduced code space using genetic algorithms. ACM SIGPLAN Notices 34, 7 (1999), 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2009. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review 51, 1 (2009), 129--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE Press, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Matteo Frigo and Steven G. Johnson. 2005. The design and implementation of FFTW3. Proceedings of the IEEE 93, 2 (2005), 216--231.Google ScholarGoogle ScholarCross RefCross Ref
  16. Archana Ganapathi, Kaushik Datta, Armando Fox, and David Patterson. 2009. A case for machine learning to optimize multicore performance. In Proceedings of the 1st USENIX Workshop on Hot Topics in Parallelism (HotPar’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Albert Hartono, Boyana Norris, and Ponnuswamy Sadayappan. 2009. Annotation-based empirical performance tuning using Orio. In Proceedings of the IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS’09). IEEE, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A stencil compiler for short-vector simd architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Himeno. 2011. Himeno benchmark. Retrieved from http://accc.riken.jp/2444.htm.Google ScholarGoogle Scholar
  20. K. Hou, H. Wang, and W. C. Feng. 2016. AAlign: A SIMD framework for pairwise sequence alignment on x86-based multi-and many-core processors. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS’16). 780--789.Google ScholarGoogle Scholar
  21. Kaixi Hou, Hao Wang, and Wu-chun Feng. 2017. GPU-UniCache: Automatic code generation of spatial blocking for stencils on GPUs. In Proceedings of the Computing Frontiers Conference (CF’17). ACM, New York, 107--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Hou, H. Wang, and W. C. Feng. 2018. A framework for the automatic vectorization of parallel sort on x86-based processors. IEEE Transactions on Parallel and Distributed Systems 29, 5 (2018), 958--972.Google ScholarGoogle ScholarCross RefCross Ref
  23. Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, and Samuel Williams. 2010. An auto-tuning framework for parallel multicore stencil computations. In Proceedings of the 2010 IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS’10). IEEE, 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  24. Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, and Jacqueline Chame. 2013. A script-based autotuning compiler system to generate high-performance CUDA code. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Karl S. Kunz and Raymond J. Luebbers. 1993. The Finite Difference Time Domain Method for Electromagnetics. CRC Press.Google ScholarGoogle Scholar
  27. Ang Li, Weifeng Liu, Mads R. B. Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song. 2017b. Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). 26:1--26:14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Li, J. Choi, I. Perros, J. Sun, and R. Vuduc. 2017a. Model-driven sparse CP decomposition for higher-order tensors. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). 1048--1057.Google ScholarGoogle Scholar
  29. Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication. In ACM SIGPLAN Notices, Vol. 48. ACM, 117--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Junhong Liu, Xin He, Weifeng Liu, and Guangming Tan. 2018. Register-based implementation of the sparse general matrix-matrix multiplication on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, 407--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Junhong Liu, Guangming Tan, Yulong Luo, Zeyao Mo, and Ninghui Sun. 2015. PAK. Retrieved from https://github.com/PAA-NCIC/PAK.Google ScholarGoogle Scholar
  32. Weifeng Liu. 2015. Parallel and Scalable Sparse Basic Linear Algebra Subprograms. Ph.D. dissertation. University of Copenhagen.Google ScholarGoogle Scholar
  33. Weifeng Liu, Ang Li, Jonathan D. Hogg, Iain S. Duff, and Brian Vinter. 2017. Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurrency and Computation: Practice and Experience 29, 21 (2017), e4244--n/a.Google ScholarGoogle ScholarCross RefCross Ref
  34. Weifeng Liu and Brian Vinter. 2015a. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, 339--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Weifeng Liu and Brian Vinter. 2015b. A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors. Journal of Parallel and Distributed Computing 85, C (Nov. 2015), 47--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yulong Luo, Guangming Tan, Zeyao Mo, and Ninghui Sun. 2015. FAST: A fast stencil autotuning framework based on an optimal-solution space model. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, New York, 187--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Thibaut Lutz, Christian Fensch, and Murray Cole. 2013. Partans: An autotuning framework for stencil computation on multi-gpu systems. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Allen D. Malony, Jan Cuny, and Sameer Shende. 1999. TAU: Tuning and Analysis Utilities. Technical Report. LALP-99--205. Los Alamos National Laboratory Publication.Google ScholarGoogle Scholar
  39. Azamat Mametjanov, Daniel Lowell, Ching-Chen Ma, and Boyana Norris. 2012. Autotuning stencil-based computations on GPUs. In Proceedings of the 2012 IEEE International Conference on Cluster Computing (CLUSTER’12). IEEE, 266--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Christen Matthias, Schenk Olaf, and Burkhart Helmar. 2011. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). IEEE, 676--687. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the 23rd International Conference on Supercomputing. ACM, 256--265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Paulius Micikevicius. 2009. 3D finite difference computation on GPUs using CUDA. In Proceedings of the 2nd Workshop on General Purpose Processing on Graphics Processing Units. ACM, 79--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Saurav Muralidharan, Michael Garland, Albert Sidelnik, and Mary Hall. 2016. Designing a tunable nested data-parallel programming system. ACM Transactions on Architecture and Code Optimization 13, 447 (Dec. 2016), Article, 24 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Apan Qasem, Michael Jason Cade, and Dan Tamir. 2012. Improved energy efficiency for multithreaded kernels through model-based autotuning. In Proceedings of the 2012 IEEE Green Technologies Conference. IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  47. Guangming Tan, Junhong Liu, and Jiajia Li. 2018. Design and implementation of adaptive SpMV library for multicore and manycore architecture. ACM Transactions on Mathematical Software 44, 4 (Aug. 2018), Article 46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Cristian Ţăpuş, I-Hsin Chung, Jeffrey K. Hollingsworth, et al. 2002. Active harmony: Towards automated performance tuning. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE Computer Society Press, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. 2009. A scalable autotuning framework for compiler optimization. In Proceedings of the International Parallel and Distributed Processing Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Didem Unat, Xing Cai, and Scott B. Baden. 2011. Mint: Realizing CUDA performance in 3D stencil methods with annotated C. In Proceedings of the International Conference on Supercomputing. ACM, 214--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Richard Vuduc, James W. Demmel, and Katherine A. Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. In Journal of Physics: Conference Series, Vol. 16. IOP Publishing, 521.Google ScholarGoogle Scholar
  53. Hao Wang, Weifeng Liu, Kaixi Hou, and Wu-chun Feng. 2016. Parallel transposition of sparse data structures. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). 33:1--33:13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. H. Wang, J. Zhang, D. Zhang, S. Pumma, and W. C. Feng. 2017. PaPar: A parallel data partitioning framework for big data applications. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). 605--614.Google ScholarGoogle Scholar
  55. Xinliang Wang, Weifeng Liu, Wei Xue, and Li Wu. 2018. swSpTRSV: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, 338--353. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, 1--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 4 (2009), 65--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018. Bridging the gap between deep learning and sparse matrix format selection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, 94--108. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An Autotuning Protocol to Rapidly Build Autotuners

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Parallel Computing
        ACM Transactions on Parallel Computing  Volume 5, Issue 2
        June 2018
        113 pages
        ISSN:2329-4949
        EISSN:2329-4957
        DOI:10.1145/3299751
        • Editor:
        • David Bader
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 January 2019
        • Accepted: 1 June 2018
        • Revised: 1 January 2018
        • Received: 1 January 2017
        Published in topc Volume 5, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader