MPI performance engineering with the MPI tool interface: the integration of MVAPICH and TAU

Authors:
Srinivasan Ramesh

University of Oregon

University of Oregon
View Profile

,
Aurèle Mahéo

University of Oregon

University of Oregon
View Profile

,
Sameer Shende

University of Oregon

University of Oregon
View Profile

,
Allen D. Malony

University of Oregon

University of Oregon
View Profile

,
Hari Subramoni

The Ohio State University

The Ohio State University
View Profile

,
Dhabaleswar K. (DK) Panda

The Ohio State University

The Ohio State University
View Profile

EuroMPI '17: Proceedings of the 24th European MPI Users' Group MeetingSeptember 2017Article No.: 16Pages 1–11https://doi.org/10.1145/3127024.3127036

Published:25 September 2017Publication History

EuroMPI '17: Proceedings of the 24th European MPI Users' Group Meeting

Pages 1–11

ABSTRACT

MPI implementations are becoming increasingly complex and highly tunable, and thus scalability limitations can come from numerous sources. The MPI Tools Interface (MPI_T) introduced as part of the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper level to detect scalability issues. The interface also provides a mechanism to re-configure the MPI library dynamically at runtime to fine-tune performance. In this paper, we propose an infrastructure that extends existing components - TAU, MVAPICH2 and BEACON to take advantage of the MPI_T interface to offer runtime introspection, online monitoring, recommendation generation and autotuning capabilities. We validate our design by developing optimizations for a combination of production and synthetic applications. We use our infrastructure to implement an autotuning policy for AmberMD[1] that monitors and reduces MVAPICH2 library internal memory footprint by 20% without affecting performance. For applications where collective communication is latency sensitive such as MiniAMR[2], our infrastructure is able to generate recommendations to enable hardware offloading of collectives supported by MVAPICH2. By implementing this recommendation, we see a 5% improvement in application runtime.

References

David A Case, Thomas E Cheatham, Tom Darden, Holger Gohlke, Ray Luo, Kenneth M Merz, Alexey Onufriev, Carlos Simmerling, Bing Wang, and Robert J Woods. The Amber biomolecular simulation programs. Journal of computational chemistry, 26(16):1668--1688, 2005. http://ambermd.org/.Google Scholar
Michael A Heroux, Douglas W Doerfler, Paul S Crozier, James M Willenbring, H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thornquist, and Robert W Numrich. Improving performance via mini-applications. Sandia National Laboratories, Tech. Rep. SAND2009-5574, 3, 2009. https://mantevo.org/.Google Scholar
MPI Forum. MPI: A Message-Passing Interface Standard. Version 3.1, June 4th 2015. http://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf (June. 2015).Google Scholar
Sameer S. Shende and Allen D. Malony. The Tau Parallel Performance System. Int. J. High Perform. Comput. Appl., 20(2):287--311, May 2006. http://tau.uoregon.edu. Google ScholarDigital Library
Jiuxing Liu, Jiesheng Wu, Sushmitha P Kini, Pete Wyckoff, and Dhabaleswar K Panda. High performance RDMA-based MPI implementation over InfiniBand. In Proceedings of the 17th annual international conference on Supercomputing, pages 295--304. ACM, 2003. Google ScholarDigital Library
Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, Jeffrey M Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, et al. Open MPI: Goals, concept, and design of a next generation MPI implementation. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting, pages 97--104. Springer, 2004.Google ScholarCross Ref
William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel computing, 22(6):789--828, 1996. Google ScholarDigital Library
Marc Pérache, Hervé Jourdren, and Raymond Namyst. MPC: A Unified Parallel Runtime for Clusters of NUMA Machines. In Proceedings of the 14th International Euro-Par Conference on Parallel Processing, Euro-Par '08, page 78--88, Berlin, Heidelberg, 2008. Springer-Verlag. Google ScholarDigital Library
Rainer Keller, George Bosilca, Graham Fagg, Michael Resch, and Jack J. Dongarra. Implementation and Usage of the PERUSE-Interface in Open MPI. In Proceedings, 13th European PVM/MPI Users' Group Meeting, Lecture Notes in Computer Science, Bonn, Germany, September 2006. Springer-Verlag. Google ScholarDigital Library
Tanzima Islam, Kathryn Mohror, and Martin Schulz. Exploring the Capabilities of the New MPI_T Interface. In Proceedings of the 21st European MPI Users' Group Meeting, EuroMPI/ASIA '14, pages 91:91--91:96, New York, NY, USA, 2014. ACM. https://computation.llnl.gov/projects/mpi_t/gyan. Google ScholarDigital Library
Esthela Gallardo, Jerome Vienne, Leonardo Fialho, Patricia Teller, and James Browne. MPI Advisor: A Minimal Overhead Tool for MPI Library Performance Tuning. In Proceedings of the 22Nd European MPI Users' Group Meeting, EuroMPI '15, pages 6:1--6:10, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
Esthela Gallardo, Jérôme Vienne, Leonardo Fialho, Patricia Teller, and James Browne. Employing MPI_T in MPI Advisor to optimize application performance. The International Journal of High Performance Computing Applications, 0(0):1094342016684005, 0.Google Scholar
Jeffrey Vetter and Chris Chambreau. mpiP: Lightweight, scalable mpi profiling. 2005. http://mpip.sourceforge.net.Google Scholar
Mohamad Chaarawi, Jeffrey M. Squyres, Edgar Gabriel, and Saber Feki. A Tool for Optimizing Runtime Parameters of Open MPI, pages 210--217. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. https://www.open-mpi.org/projects/otpo/. Google ScholarDigital Library
M. Gerndt and M. Ott. Automatic Performance Analysis with Periscope. Concurr. Comput.: Pract. Exper., 22(6):736--748, April 2010. http://periscope.in.tum.de/. Google ScholarDigital Library
Anna Sikora, Eduardo César, Isaías Comprés, and Michael Gerndt. Autotuning of MPI Applications Using PTF. In Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications, SEM4HPC '16, pages 31--38, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
Simone Pellegrini, Thomas Fahringer, Herbert Jordan, and Hans Moritsch. Automatic Tuning of MPI Runtime Parameter Settings by Using Machine Learning. In Proceedings of the 7th ACM International Conference on Computing Frontiers, CF '10, pages 115--116, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
Kevin Huck, Sameer Shende, Allen Malony, Hartmut Kaiser, Allan Porterfield, Rob Fowler, and Ron Brightwell. An Early Prototype of an Autonomic Performance Environment for Exascale. In Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, ROSS '13, pages 8:1--8:8, New York, NY, USA, 2013. ACM. http://khuck.github.io/xpress-apex/. Google ScholarDigital Library
Swann Perarnau, Rinku Gupta, Pete Beckman, et. al. Argo: An Exascale Operating System and Runtime, 2015. http://sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/poster_files/post298s2-file2.pdf.Google Scholar
Swann Perarnau, Rajeev Thakur, Kamil Iskra, Ken Raffenetti, Franck Cappello, Rinku Gupta, Pete Beckman, Marc Snir, Henry Hoffmann, Martin Schulz, and Barry Rountree. Distributed Monitoring and Management of Exascale Systems in the Argo Project. In Proceedings of the 15th IFIP WG 6.1 International Conference on Distributed Applications and Interoperable Systems - Volume 9038, pages 173--178, NewYork, NY, USA, 2015. Springer-Verlag NewYork, Inc. Google ScholarDigital Library
TACC Stampede cluster. The University of Texas at Austin: http://www.tacc.utexas.edu.Google Scholar
Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, et al. Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction. In Proceedings of the First Workshop on Optimization of Communication in HPC, pages 1--10. IEEE Press, 2016. Google ScholarDigital Library
Andreas Knüpfer, Holger Brunst, Jens Doleschal, Matthias Jurenz, Matthias Lieber, Holger Mickler, Matthias S Müller, and Wolfgang E Nagel. The vampir performance analysis tool-set. In Tools for High Performance Computing, pages 139--155. Springer, 2008. www.vampir.eu.Google ScholarCross Ref

Index Terms

MPI performance engineering with the MPI tool interface: the integration of MVAPICH and TAU
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed programming languages

Recommendations

Exploring the MPI tool information interface

The latest version of the MPI Standard, MPI 3.0, includes a new interface, the MPI Tools Information Interface MPI_T, which provides tools with access to MPI internal performance and configuration information. In combination with the complementary and ...
Read More
Employing MPI_T in MPI Advisor to optimize application performance

MPI_T, the MPI Tool Information Interface, was introduced in the MPI 3.0 standard with the aim of enabling the development of more effective tools to support the Message Passing Interface MPI, a standardized and portable message-passing system that is ...
Read More
MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory

Hybrid parallel programming with the message passing interface (MPI) for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroMPI '17: Proceedings of the 24th European MPI Users' Group Meeting
September 2017
169 pages
ISBN:9781450348492
DOI:10.1145/3127024
Conference Chair:
Antonio J. Peña
Barcelona Supercomputing Center
,
General Chair:
Pavan Balaji
Argonne National Laboratory
,
Program Chairs:
William Gropp
University of Illinois, Urbana-Champaign
,
Rajeev Thakur
Argonne National Laboratory
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 September 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
BEACON
MPI_T
MVAPICH2
TAU
autotuning
performance engineering
performance recommendations
runtime introspection
Qualifiers
- research-article
Conference

Acceptance Rates
EuroMPI '17 Paper Acceptance Rate17of37submissions,46%Overall Acceptance Rate66of139submissions,47%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 295
  Total Downloads
- Downloads (Last 12 months)63
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MPI performance engineering with the MPI tool interface: the integration of MVAPICH and TAU

EuroMPI '17: Proceedings of the 24th European MPI Users' Group Meeting

ABSTRACT

References

Cited By

Index Terms

Recommendations

Exploring the MPI tool information interface

Employing MPI_T in MPI Advisor to optimize application performance

MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory