skip to main content
10.1145/3297663.3310306acmconferencesArticle/Chapter ViewAbstractPublication PagesicpeConference Proceedingsconference-collections
research-article

Performance Prediction of Explicit ODE Methods on Multi-Core Cluster Systems

Published:04 April 2019Publication History

ABSTRACT

When migrating a scientific application to a new HPC system, the program code usually has to be re-tuned to achieve the best possible performance. Auto-tuning techniques are a promising approach to support the portability of performance. Often, a large pool of possible implementation variants exists from which the most efficient variant needs to be selected. Ideally, auto-tuning approaches should be capable of undertaking this task in an efficient manner for a new HPC system and new characteristics of the input data by applying suitable analytic models and program transformations.

In this article, we discuss a performance prediction methodology for multi-core cluster applications, which can assist this selection process by significantly reducing the selection effort compared to in-depth runtime tests. The methodology proposed is an extension of an analytical performance prediction model for shared-memory applications introduced in our previous work. Our methodology is based on the execution-cache-memory (ECM) performance model and estimations of intra-node and inter-node communication costs, which we apply to numerical solution methods for ordinary differential equations (ODEs). In particular, we investigate whether it is possible to obtain accurate performance predictions for hybrid MPI/OpenMP implementation variants in order to support the variant selection. We demonstrate that our approach is able to reliably select a set of efficient variants for a given configuration (ODE system, solver and hardware platform) and, thus, to narrow down the search space for possible later empirical tuning.

References

  1. A. Bartel, M. Günther, R. Pulch, and P. Rentrop. 2002. Numerical Techniques for Different Time Scales in Electric Circuit Simulation. In High Performance Scientific and Engineering Computing: Proc. 3rd Int. FORTWIHR Conf. HPSEC. Springer, Berlin, Heidelberg, 343--360.Google ScholarGoogle Scholar
  2. J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. 1997. Optimizing Matrix Multiply using PHiPAC: A Portable High-performance, ANSI C Coding Methodology. In Proc. 11th Int. Conf. Supercomputing (ICS '97). ACM, New York, NY, USA, 340--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Burrage. 1995. Parallel and Sequential Methods for Ordinary Differential Equations. Clarendon Press, New York, NY, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Calvo, J.M. Franco, and L. Rández. 2004. A new minimum storage Runge--Kutta scheme for computational acoustics. Journal Comput. Physics, Vol. 201 (2004), 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Christen, O. Schenk, and H. Burkhart. 2011. PATUS: A Code Generation and Autotuning Framework For Parallel Iterative Stencil Computations on Modern Microarchitectures. In Proc. of the 25th IEEE Int. Parallel & Distributed Processing Symp. IEEE, Los Alamitos, USA, 676--687. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Das, S. S. Mullick, and P.N. Suganthan. 2016. Recent advances in differential evolution -- An updated survey. Swarm and Evolutionary Computation, Vol. 27 (2016), 1--30.Google ScholarGoogle ScholarCross RefCross Ref
  7. Intel Corporation G. Slavova. 2018. Introducing Intel® MPI Benchmarks. https://software.intel.com/en-us/articles/intel-mpi-benchmarks. (Feb. 2018).Google ScholarGoogle Scholar
  8. M. Gerndt, E. César, and S. Benkner (Eds.). 2015. Automatic Tuning of HPC Applications -- The Periscope Tuning Framework .Shaker Verlag.Google ScholarGoogle Scholar
  9. R. Haberman. 1998. Elementary Applied Partial Differential Equations: With Fourier Series and Boundary Value Problems 3rd ed.). Prentice Hall.Google ScholarGoogle Scholar
  10. E. Hairer, S.P. Nørsett, and Gerhard Wanner. 2000. Solving Ordinary Differential Equations I: Nonstiff Problems 2nd rev. ed.). Springer, Berlin, Heidelberg.Google ScholarGoogle Scholar
  11. E. Hairer and G. Wanner. 2002. Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems 2nd rev. ed.). Springer, Berlin, Heidelberg.Google ScholarGoogle Scholar
  12. J. Hammer, G. Hager, J. Eitzinger, and G. Wellein. 2015. Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft. In Proc. 6th Int. Workshop Performance Modeling, Benchmarking, & Simulation High Performance Computing Systems (PMBS '15). ACM, New York, NY, USA, Article 4, pages11 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. J. Lo, S. Williams, B. Van Straalen, T. J. Ligocki, M. J. Cordery, N. J. Wright, M. W. Hall, and L. Oliker. 2015. Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. Springer International Publishing, Cham, 129--148.Google ScholarGoogle Scholar
  14. F. Mazzia, C. Magherini, and J. Kierzenka. 2008. Test Set for Initial Value Problem Solvers, Release 2.4. https://archimede.dm.uniba.it/-testset/. (Feb. 2008).Google ScholarGoogle Scholar
  15. J. A. Nelder and R. Mead. 1965. A Simplex Method for Function Minimization. Comput. J., Vol. 7, 4 (1965), 308--313.Google ScholarGoogle ScholarCross RefCross Ref
  16. S. P. Nørsett and H. H. Simonsen. 1989. Aspects of Parallel Runge--Kutta Methods. Numerical Methods for Ordinary Differential Equations (Lecture Notes Math.). Springer, Berlin, Heidelberg, 103--117.Google ScholarGoogle Scholar
  17. D. Panda. 2018. OSU Micro-Benchmarks 5.4.4, Release 5.4.4. http://mvapich.cse.ohio-state.edu/benchmarks/. (Sep. 2018).Google ScholarGoogle Scholar
  18. J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proc. of the 34th ACM SIGPLAN Conf. on Programming Language Design & Implementation (PLDI'13). ACM, New York, NY, USA, 519--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. H. Reussner, P. Sanders, and J. L. Träff. 2002. SKaMPI: A comprehensive benchmark for public benchmarking of MPI. Scientific Programming, Vol. 10, 1 (2002), 55--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. A. Schmitt. 2014. Peer Methods with Improved Embedded Sensitivities for Parameter-dependent ODEs. J. Comput. Appl. Math., Vol. 256 (Jan. 2014), 242--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Seiferth, C. Alappat, M. Korch, and T. Rauber. 2018. Applicability of the ECM Performance Model to Explicit ODE Methods on Current Multi-core Processors. In High Performance Computing. Springer, Berlin, Heidelberg, 163--183.Google ScholarGoogle Scholar
  22. H. Stengel, J. Treibig, G. Hager, and G. Wellein. 2015. Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model. Proc. 29th ACM Int. Conf. Supercomputing (ICS '15). ACM, New York, NY, USA, 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. R. Tallent and J. M. Mellor-Crummey. 2009. Effective Performance Measurement and Analysis of Multithreaded Applications. In Proc. 14th ACM SIGPLAN Symp. Principles & Practice Parallel Programming (PPoPP '09). ACM, New York, NY, USA, 229--240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C E. Leiserson. 2011. The Pochoir Stencil Compiler. In Proc. of the Twenty-third Annual ACM Symp. on Parallelism in Algorithms & Architectures (SPAA '11). ACM, New York, NY, USA, 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. M. Tikir and J. K. Hollingsworth. 2004. Using Hardware Counters to Automatically Improve Memory Performance. In Proc. ACM/IEEE Conf. Supercomputing (SC '04). IEEE Computer Society, Washington, DC, USA, 46--. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Tiwari and J. K. Hollingsworth. 2011. Online Adaptive Code Generation and Tuning. In Proc. IEEE Int. Parallel & Distributed Processing Symp. IEEE, 879--892. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Treibig and G. Hager. 2010. Introducing a Performance Model for Bandwidth-Limited Loop Kernels. In Parallel Processing and Applied Mathematics: 8th Int. Conf., PPAM 2009. Revised Selected Papers, Part I. Springer, Berlin, Heidelberg, 615--624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. P. J. van der Houwen and B. P. Sommeijer. 1990. Parallel iteration of high-order Runge--Kutta Methods with step size control. J. Comput. Appl. Math., Vol. 29 (1990), 111--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. C. Whaley and J. J. Dongarra. 1999. Automatically Tuned Linear Algebra Software. Technical Report. University of Tennessee. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Williams, A. Waterman, and D. Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM, Vol. 52, 4 (April 2009), 65--76. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    ICPE '19: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering
    April 2019
    348 pages
    ISBN:9781450362399
    DOI:10.1145/3297663

    Copyright © 2019 ACM

    Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 4 April 2019

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    ICPE '19 Paper Acceptance Rate13of71submissions,18%Overall Acceptance Rate252of851submissions,30%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader