research-article

Performance Prediction of Explicit ODE Methods on Multi-Core Cluster Systems

Authors:
Markus Scherg

University of Bayreuth, Bayreuth, Germany

University of Bayreuth, Bayreuth, Germany
View Profile

,
Johannes Seiferth

University of Bayreuth, Bayreuth, Germany

University of Bayreuth, Bayreuth, Germany
View Profile

,
Matthias Korch

University of Bayreuth, Bayreuth, Germany

University of Bayreuth, Bayreuth, Germany
View Profile

,
Thomas Rauber

University of Bayreuth, Bayreuth, Germany

University of Bayreuth, Bayreuth, Germany
View Profile

ICPE '19: Proceedings of the 2019 ACM/SPEC International Conference on Performance EngineeringApril 2019Pages 139–150https://doi.org/10.1145/3297663.3310306

Published:04 April 2019Publication History

ICPE '19: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pages 139–150

ABSTRACT

When migrating a scientific application to a new HPC system, the program code usually has to be re-tuned to achieve the best possible performance. Auto-tuning techniques are a promising approach to support the portability of performance. Often, a large pool of possible implementation variants exists from which the most efficient variant needs to be selected. Ideally, auto-tuning approaches should be capable of undertaking this task in an efficient manner for a new HPC system and new characteristics of the input data by applying suitable analytic models and program transformations.

In this article, we discuss a performance prediction methodology for multi-core cluster applications, which can assist this selection process by significantly reducing the selection effort compared to in-depth runtime tests. The methodology proposed is an extension of an analytical performance prediction model for shared-memory applications introduced in our previous work. Our methodology is based on the execution-cache-memory (ECM) performance model and estimations of intra-node and inter-node communication costs, which we apply to numerical solution methods for ordinary differential equations (ODEs). In particular, we investigate whether it is possible to obtain accurate performance predictions for hybrid MPI/OpenMP implementation variants in order to support the variant selection. We demonstrate that our approach is able to reliably select a set of efficient variants for a given configuration (ODE system, solver and hardware platform) and, thus, to narrow down the search space for possible later empirical tuning.

References

A. Bartel, M. Günther, R. Pulch, and P. Rentrop. 2002. Numerical Techniques for Different Time Scales in Electric Circuit Simulation. In High Performance Scientific and Engineering Computing: Proc. 3rd Int. FORTWIHR Conf. HPSEC. Springer, Berlin, Heidelberg, 343--360.Google Scholar
J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. 1997. Optimizing Matrix Multiply using PHiPAC: A Portable High-performance, ANSI C Coding Methodology. In Proc. 11th Int. Conf. Supercomputing (ICS '97). ACM, New York, NY, USA, 340--347. Google ScholarDigital Library
K. Burrage. 1995. Parallel and Sequential Methods for Ordinary Differential Equations. Clarendon Press, New York, NY, USA. Google ScholarDigital Library
M. Calvo, J.M. Franco, and L. Rández. 2004. A new minimum storage Runge--Kutta scheme for computational acoustics. Journal Comput. Physics, Vol. 201 (2004), 1--12. Google ScholarDigital Library
M. Christen, O. Schenk, and H. Burkhart. 2011. PATUS: A Code Generation and Autotuning Framework For Parallel Iterative Stencil Computations on Modern Microarchitectures. In Proc. of the 25th IEEE Int. Parallel & Distributed Processing Symp. IEEE, Los Alamitos, USA, 676--687. Google ScholarDigital Library
S. Das, S. S. Mullick, and P.N. Suganthan. 2016. Recent advances in differential evolution -- An updated survey. Swarm and Evolutionary Computation, Vol. 27 (2016), 1--30.Google ScholarCross Ref
Intel Corporation G. Slavova. 2018. Introducing Intel® MPI Benchmarks. https://software.intel.com/en-us/articles/intel-mpi-benchmarks. (Feb. 2018).Google Scholar
M. Gerndt, E. César, and S. Benkner (Eds.). 2015. Automatic Tuning of HPC Applications -- The Periscope Tuning Framework .Shaker Verlag.Google Scholar
R. Haberman. 1998. Elementary Applied Partial Differential Equations: With Fourier Series and Boundary Value Problems 3rd ed.). Prentice Hall.Google Scholar
E. Hairer, S.P. Nørsett, and Gerhard Wanner. 2000. Solving Ordinary Differential Equations I: Nonstiff Problems 2nd rev. ed.). Springer, Berlin, Heidelberg.Google Scholar
E. Hairer and G. Wanner. 2002. Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems 2nd rev. ed.). Springer, Berlin, Heidelberg.Google Scholar
J. Hammer, G. Hager, J. Eitzinger, and G. Wellein. 2015. Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft. In Proc. 6th Int. Workshop Performance Modeling, Benchmarking, & Simulation High Performance Computing Systems (PMBS '15). ACM, New York, NY, USA, Article 4, pages11 pages. Google ScholarDigital Library
Y. J. Lo, S. Williams, B. Van Straalen, T. J. Ligocki, M. J. Cordery, N. J. Wright, M. W. Hall, and L. Oliker. 2015. Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. Springer International Publishing, Cham, 129--148.Google Scholar
F. Mazzia, C. Magherini, and J. Kierzenka. 2008. Test Set for Initial Value Problem Solvers, Release 2.4. https://archimede.dm.uniba.it/-testset/. (Feb. 2008).Google Scholar
J. A. Nelder and R. Mead. 1965. A Simplex Method for Function Minimization. Comput. J., Vol. 7, 4 (1965), 308--313.Google ScholarCross Ref
S. P. Nørsett and H. H. Simonsen. 1989. Aspects of Parallel Runge--Kutta Methods. Numerical Methods for Ordinary Differential Equations (Lecture Notes Math.). Springer, Berlin, Heidelberg, 103--117.Google Scholar
D. Panda. 2018. OSU Micro-Benchmarks 5.4.4, Release 5.4.4. http://mvapich.cse.ohio-state.edu/benchmarks/. (Sep. 2018).Google Scholar
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proc. of the 34th ACM SIGPLAN Conf. on Programming Language Design & Implementation (PLDI'13). ACM, New York, NY, USA, 519--530. Google ScholarDigital Library
R. H. Reussner, P. Sanders, and J. L. Träff. 2002. SKaMPI: A comprehensive benchmark for public benchmarking of MPI. Scientific Programming, Vol. 10, 1 (2002), 55--65. Google ScholarDigital Library
B. A. Schmitt. 2014. Peer Methods with Improved Embedded Sensitivities for Parameter-dependent ODEs. J. Comput. Appl. Math., Vol. 256 (Jan. 2014), 242--253. Google ScholarDigital Library
J. Seiferth, C. Alappat, M. Korch, and T. Rauber. 2018. Applicability of the ECM Performance Model to Explicit ODE Methods on Current Multi-core Processors. In High Performance Computing. Springer, Berlin, Heidelberg, 163--183.Google Scholar
H. Stengel, J. Treibig, G. Hager, and G. Wellein. 2015. Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model. Proc. 29th ACM Int. Conf. Supercomputing (ICS '15). ACM, New York, NY, USA, 207--216. Google ScholarDigital Library
N. R. Tallent and J. M. Mellor-Crummey. 2009. Effective Performance Measurement and Analysis of Multithreaded Applications. In Proc. 14th ACM SIGPLAN Symp. Principles & Practice Parallel Programming (PPoPP '09). ACM, New York, NY, USA, 229--240. Google ScholarDigital Library
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C E. Leiserson. 2011. The Pochoir Stencil Compiler. In Proc. of the Twenty-third Annual ACM Symp. on Parallelism in Algorithms & Architectures (SPAA '11). ACM, New York, NY, USA, 117--128. Google ScholarDigital Library
M. M. Tikir and J. K. Hollingsworth. 2004. Using Hardware Counters to Automatically Improve Memory Performance. In Proc. ACM/IEEE Conf. Supercomputing (SC '04). IEEE Computer Society, Washington, DC, USA, 46--. Google ScholarDigital Library
A. Tiwari and J. K. Hollingsworth. 2011. Online Adaptive Code Generation and Tuning. In Proc. IEEE Int. Parallel & Distributed Processing Symp. IEEE, 879--892. Google ScholarDigital Library
J. Treibig and G. Hager. 2010. Introducing a Performance Model for Bandwidth-Limited Loop Kernels. In Parallel Processing and Applied Mathematics: 8th Int. Conf., PPAM 2009. Revised Selected Papers, Part I. Springer, Berlin, Heidelberg, 615--624. Google ScholarDigital Library
P. J. van der Houwen and B. P. Sommeijer. 1990. Parallel iteration of high-order Runge--Kutta Methods with step size control. J. Comput. Appl. Math., Vol. 29 (1990), 111--127. Google ScholarDigital Library
R. C. Whaley and J. J. Dongarra. 1999. Automatically Tuned Linear Algebra Software. Technical Report. University of Tennessee. Google ScholarDigital Library
S. Williams, A. Waterman, and D. Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM, Vol. 52, 4 (April 2009), 65--76. Google ScholarDigital Library

Recommendations

Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms
IWOCL '19: Proceedings of the International Workshop on OpenCL

A key challenge in programming high-performance applications is achieving portable performance, such that the same program code can reach a consistent level of performance over the variety of modern parallel processors, including multi-core CPU and ...
Read More
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Read More
Performance, portability, and productivity for data-parallel applications on multi- and many-core architectures
SPLASH Companion 2019: Proceedings Companion of the 2019 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity

We present a novel approach to performance, portability, and productivity of data-parallel computations on multi- and many-core architectures. Our approach is based on Multi-Dimensional Homomorphisms (MDHs) -- a formally defined class of functions that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICPE '19: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering
April 2019
348 pages
ISBN:9781450362399
DOI:10.1145/3297663
General Chairs:
Varsha Apte
IIT Bombay, India
,
Antinisca Di Marco
University of L'Aquila, Italy
,
Program Chairs:
Marin Litoiu
York University, Canada
,
José Merseguer
Universidad de Zaragoza, Spain
Copyright © 2019 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 April 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
auto-tuning
distributed-memory
ecm model
mpi
ode
performance prediction
variant selection
Qualifiers
- research-article
Conference

Acceptance Rates
ICPE '19 Paper Acceptance Rate13of71submissions,18%Overall Acceptance Rate252of851submissions,30%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 169
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Performance Prediction of Explicit ODE Methods on Multi-Core Cluster Systems

ICPE '19: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

ABSTRACT

References

Cited By

Recommendations

Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Performance, portability, and productivity for data-parallel applications on multi- and many-core architectures