ABSTRACT
When migrating a scientific application to a new HPC system, the program code usually has to be re-tuned to achieve the best possible performance. Auto-tuning techniques are a promising approach to support the portability of performance. Often, a large pool of possible implementation variants exists from which the most efficient variant needs to be selected. Ideally, auto-tuning approaches should be capable of undertaking this task in an efficient manner for a new HPC system and new characteristics of the input data by applying suitable analytic models and program transformations.
In this article, we discuss a performance prediction methodology for multi-core cluster applications, which can assist this selection process by significantly reducing the selection effort compared to in-depth runtime tests. The methodology proposed is an extension of an analytical performance prediction model for shared-memory applications introduced in our previous work. Our methodology is based on the execution-cache-memory (ECM) performance model and estimations of intra-node and inter-node communication costs, which we apply to numerical solution methods for ordinary differential equations (ODEs). In particular, we investigate whether it is possible to obtain accurate performance predictions for hybrid MPI/OpenMP implementation variants in order to support the variant selection. We demonstrate that our approach is able to reliably select a set of efficient variants for a given configuration (ODE system, solver and hardware platform) and, thus, to narrow down the search space for possible later empirical tuning.
- A. Bartel, M. Günther, R. Pulch, and P. Rentrop. 2002. Numerical Techniques for Different Time Scales in Electric Circuit Simulation. In High Performance Scientific and Engineering Computing: Proc. 3rd Int. FORTWIHR Conf. HPSEC. Springer, Berlin, Heidelberg, 343--360.Google Scholar
- J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. 1997. Optimizing Matrix Multiply using PHiPAC: A Portable High-performance, ANSI C Coding Methodology. In Proc. 11th Int. Conf. Supercomputing (ICS '97). ACM, New York, NY, USA, 340--347. Google ScholarDigital Library
- K. Burrage. 1995. Parallel and Sequential Methods for Ordinary Differential Equations. Clarendon Press, New York, NY, USA. Google ScholarDigital Library
- M. Calvo, J.M. Franco, and L. Rández. 2004. A new minimum storage Runge--Kutta scheme for computational acoustics. Journal Comput. Physics, Vol. 201 (2004), 1--12. Google ScholarDigital Library
- M. Christen, O. Schenk, and H. Burkhart. 2011. PATUS: A Code Generation and Autotuning Framework For Parallel Iterative Stencil Computations on Modern Microarchitectures. In Proc. of the 25th IEEE Int. Parallel & Distributed Processing Symp. IEEE, Los Alamitos, USA, 676--687. Google ScholarDigital Library
- S. Das, S. S. Mullick, and P.N. Suganthan. 2016. Recent advances in differential evolution -- An updated survey. Swarm and Evolutionary Computation, Vol. 27 (2016), 1--30.Google ScholarCross Ref
- Intel Corporation G. Slavova. 2018. Introducing Intel® MPI Benchmarks. https://software.intel.com/en-us/articles/intel-mpi-benchmarks. (Feb. 2018).Google Scholar
- M. Gerndt, E. César, and S. Benkner (Eds.). 2015. Automatic Tuning of HPC Applications -- The Periscope Tuning Framework .Shaker Verlag.Google Scholar
- R. Haberman. 1998. Elementary Applied Partial Differential Equations: With Fourier Series and Boundary Value Problems 3rd ed.). Prentice Hall.Google Scholar
- E. Hairer, S.P. Nørsett, and Gerhard Wanner. 2000. Solving Ordinary Differential Equations I: Nonstiff Problems 2nd rev. ed.). Springer, Berlin, Heidelberg.Google Scholar
- E. Hairer and G. Wanner. 2002. Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems 2nd rev. ed.). Springer, Berlin, Heidelberg.Google Scholar
- J. Hammer, G. Hager, J. Eitzinger, and G. Wellein. 2015. Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft. In Proc. 6th Int. Workshop Performance Modeling, Benchmarking, & Simulation High Performance Computing Systems (PMBS '15). ACM, New York, NY, USA, Article 4, pages11 pages. Google ScholarDigital Library
- Y. J. Lo, S. Williams, B. Van Straalen, T. J. Ligocki, M. J. Cordery, N. J. Wright, M. W. Hall, and L. Oliker. 2015. Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. Springer International Publishing, Cham, 129--148.Google Scholar
- F. Mazzia, C. Magherini, and J. Kierzenka. 2008. Test Set for Initial Value Problem Solvers, Release 2.4. https://archimede.dm.uniba.it/-testset/. (Feb. 2008).Google Scholar
- J. A. Nelder and R. Mead. 1965. A Simplex Method for Function Minimization. Comput. J., Vol. 7, 4 (1965), 308--313.Google ScholarCross Ref
- S. P. Nørsett and H. H. Simonsen. 1989. Aspects of Parallel Runge--Kutta Methods. Numerical Methods for Ordinary Differential Equations (Lecture Notes Math.). Springer, Berlin, Heidelberg, 103--117.Google Scholar
- D. Panda. 2018. OSU Micro-Benchmarks 5.4.4, Release 5.4.4. http://mvapich.cse.ohio-state.edu/benchmarks/. (Sep. 2018).Google Scholar
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proc. of the 34th ACM SIGPLAN Conf. on Programming Language Design & Implementation (PLDI'13). ACM, New York, NY, USA, 519--530. Google ScholarDigital Library
- R. H. Reussner, P. Sanders, and J. L. Träff. 2002. SKaMPI: A comprehensive benchmark for public benchmarking of MPI. Scientific Programming, Vol. 10, 1 (2002), 55--65. Google ScholarDigital Library
- B. A. Schmitt. 2014. Peer Methods with Improved Embedded Sensitivities for Parameter-dependent ODEs. J. Comput. Appl. Math., Vol. 256 (Jan. 2014), 242--253. Google ScholarDigital Library
- J. Seiferth, C. Alappat, M. Korch, and T. Rauber. 2018. Applicability of the ECM Performance Model to Explicit ODE Methods on Current Multi-core Processors. In High Performance Computing. Springer, Berlin, Heidelberg, 163--183.Google Scholar
- H. Stengel, J. Treibig, G. Hager, and G. Wellein. 2015. Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model. Proc. 29th ACM Int. Conf. Supercomputing (ICS '15). ACM, New York, NY, USA, 207--216. Google ScholarDigital Library
- N. R. Tallent and J. M. Mellor-Crummey. 2009. Effective Performance Measurement and Analysis of Multithreaded Applications. In Proc. 14th ACM SIGPLAN Symp. Principles & Practice Parallel Programming (PPoPP '09). ACM, New York, NY, USA, 229--240. Google ScholarDigital Library
- Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C E. Leiserson. 2011. The Pochoir Stencil Compiler. In Proc. of the Twenty-third Annual ACM Symp. on Parallelism in Algorithms & Architectures (SPAA '11). ACM, New York, NY, USA, 117--128. Google ScholarDigital Library
- M. M. Tikir and J. K. Hollingsworth. 2004. Using Hardware Counters to Automatically Improve Memory Performance. In Proc. ACM/IEEE Conf. Supercomputing (SC '04). IEEE Computer Society, Washington, DC, USA, 46--. Google ScholarDigital Library
- A. Tiwari and J. K. Hollingsworth. 2011. Online Adaptive Code Generation and Tuning. In Proc. IEEE Int. Parallel & Distributed Processing Symp. IEEE, 879--892. Google ScholarDigital Library
- J. Treibig and G. Hager. 2010. Introducing a Performance Model for Bandwidth-Limited Loop Kernels. In Parallel Processing and Applied Mathematics: 8th Int. Conf., PPAM 2009. Revised Selected Papers, Part I. Springer, Berlin, Heidelberg, 615--624. Google ScholarDigital Library
- P. J. van der Houwen and B. P. Sommeijer. 1990. Parallel iteration of high-order Runge--Kutta Methods with step size control. J. Comput. Appl. Math., Vol. 29 (1990), 111--127. Google ScholarDigital Library
- R. C. Whaley and J. J. Dongarra. 1999. Automatically Tuned Linear Algebra Software. Technical Report. University of Tennessee. Google ScholarDigital Library
- S. Williams, A. Waterman, and D. Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM, Vol. 52, 4 (April 2009), 65--76. Google ScholarDigital Library
Recommendations
Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms
IWOCL '19: Proceedings of the International Workshop on OpenCLA key challenge in programming high-performance applications is achieving portable performance, such that the same program code can reach a consistent level of performance over the variety of modern parallel processors, including multi-core CPU and ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisOpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Performance, portability, and productivity for data-parallel applications on multi- and many-core architectures
SPLASH Companion 2019: Proceedings Companion of the 2019 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for HumanityWe present a novel approach to performance, portability, and productivity of data-parallel computations on multi- and many-core architectures. Our approach is based on Multi-Dimensional Homomorphisms (MDHs) -- a formally defined class of functions that ...
Comments