ABSTRACT
This paper provides an in-depth analysis of the software overheads in the MPI performance-critical path and exposes mandatory performance overheads that are unavoidable based on the MPI-3.1 specification. We first present a highly optimized implementation of the MPI-3.1 standard in which the communication stack---all the way from the application to the low-level network communication API---takes only a few tens of instructions. We carefully study these instructions and analyze the root cause of the overheads based on specific requirements from the MPI standard that are unavoidable under the current MPI standard. We recommend potential changes to the MPI standard that can minimize these overheads. Our experimental results on a variety of network architectures and applications demonstrate significant benefits from our proposed changes.
- 2017. Center for Exascale Simulation of Advanced Reactors. https://cesar.mcs.anl.gov. (2017).Google Scholar
- 2017. Center for Exascale Simulation of Combustion in Turbulence. https://science.energy.gov/ascr/research/scidac/co-design/. (2017).Google Scholar
- 2017,. CORAL Benchmarks. https://asc.llnl.gov/CORAL-benchmarks. (2017,).Google Scholar
- 2017. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://codesign.llnl.gov/lulesh.php. (2017).Google Scholar
- 2017. Monte Carlo Benchmark (MCB). https://codesign.llnl.gov/mcb.php. (2017).Google Scholar
- 2017. NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html. (2017).Google Scholar
- 2017. Nekbone. https://cesar.mcs.anl.gov/content/software/thermal_hydraulics. (2017).Google Scholar
- 2017. QMCPack. http://qmcpack.org. (2017).Google Scholar
- 2017. The Local-Self-Consistent Mutliple-Scattering (LSMSL) Code. https://www.ccs.ornl.gov/mri/repository/LSMS/index.html. (2017).Google Scholar
- Adnan Agbaria, Dong-In Kang, and Karandeep Singh. 2006. LMPI: MPI for heterogeneous embedded distributed systems. In Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on, Vol. 1. IEEE, 8--pp. Google ScholarDigital Library
- Abdelhalim Amer, Pavan Balaji, Wesley Bland, William Gropp, Rob Latham, Huiwei Lu, Lena Oden, Antonio Pena, Ken Raffenetti, Sangmin Seo, et al. 2015. MPICH User's Guide. (2015).Google Scholar
- Pavan Balaji, Darius Buntinas, D. Goodell, W. D. Gropp, and Rajeev Thakur. 2010. Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming. International Journal of High Performance Computing Applications (IJHPCA) 24 (2010), 49--57. Google ScholarDigital Library
- Brian W Barrett, Ron Brightwell, Ryan Grant, Simon D Hammond, and K Scott Hemmert. 2014. An evaluation of MPI message rate on hybrid-core processors. (2014).Google Scholar
- Surendra Byna, Xian-He Sun, Rajeev Thakur, and William Gropp. 2006. Automatic memory optimizations for improving MPI derived datatype performance. In European Parallel Virtual Machine/Message Passing Interface UsersâĂŹ Group Meeting. Springer, 238--246. Google ScholarDigital Library
- James Dinan, Pavan Balaji, Dave Goodell, Doug Miller, Marc Snir, and Rajeev Thakur. 2013. Enabling MPI Interoperability through Flexible Communication Endpoints. In Proceedings of the 17th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface (EuroMPI'13). Madrid, Spain, 13--18. Google ScholarDigital Library
- P. Fischer, K. Heisey, and M. Min. 2015. Scaling Limits for PDE-Based Simulation (Invited). In 22nd AIAA Computational Fluid Dynamics Conference, AIAA Aviation. AIAA 2015--3049.Google Scholar
- P. Fischer, J. Lottes, and S. Kerkemeier. 2008. Nek5000: Open source spectral element CFD solver. http://nek5000.mcs.anl.gov and https://github.com/Nek5000/nek5000. (2008).Google Scholar
- P. F. Fischer and A. T. Patera. 1991. Parallel Spectral Element Solution of the Stokes Problem. J. Comput. Phys. 92 (1991), 380--421. Google ScholarDigital Library
- Mario Flajslik, James Dinan, and Keith D Underwood. 2016. Mitigating MPI message matching misery. In International Conference on High Performance Computing. Springer, 281--299.Google ScholarCross Ref
- William Gropp, Torsten Hoefler, Rajeev Thakur, and Ewing Lusk. 2004. Using Advanced MPI: Modern Features of the Message-Passing Interface. MIT Press. Google ScholarDigital Library
- William Gropp, Ewing Lusk, and Rajeev Thakur. 1999. Using MPI-2: Advanced Features of the Message-Passing Interface. MIT Press. Google ScholarDigital Library
- Yanfei Guo, Charles Archer, Michael Blocksome, Scott Parker, Wesley Bland, Kenneth J. Raffenetti, and Pavan Balaji. 2017. Memory Compression Techniques for Network Address Management in MPI. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). Orlando, Florida.Google Scholar
- Salman Habib, Vitali Morozov, Nicholas Frontiere, Hal Finkel, Adrian Pope, and Katrin Heitmann. 2013. HACC: Extreme Scaling and Performance Across Diverse Architectures. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 6, 10 pages. Google ScholarDigital Library
- Michael A Heroux, Douglas W Doerler, Paul S Crozier, James M Willenbring, H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thornquist, and Robert W Numrich. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories.Google Scholar
- Torsten Hoefler, James Dinan, Rajeev Thakur, Brian Barrett, Pavan Balaji, William Gropp, and Keith Underwood. 2015. Remote Memory Access Programming in MPI-3. TOPC'15 (2015). Google ScholarDigital Library
- MPI Forum. 2015. MPI: A Message Passing Interface Standard. (2015). http://www.mpi-forum.org/docs/docs.html.Google Scholar
- M. Otten, J. Gong, A. Mametjanov, A. Vose, J. Levesque, P. Fischer, and M. Min. 2016. An MPI/OpenACC Implementation of a High Order Electromagnetics Solver with GPUDirect Communication. Int. J. High Perf. Comput. Appl. (2016).Google Scholar
- Mohammad J Rashti and Ahmad Afsahi. 2008. Improving communication progress and overlap in mpi rendezvous protocol over rdma-enabled interconnects. In High Performance Computing Systems and Applications, 2008. HPCS 2008. 22nd International Symposium on. IEEE, 95--101. Google ScholarDigital Library
- Xian-He Sun et al. 2003. Improving the performance of MPI derived datatypes by optimizing memory-access cost. In Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on. IEEE, 412--419.Google Scholar
- Rajeev Thakur and William D Gropp. 2003. Improving the performance of collective operations in MPICH. In European Parallel Virtual Machine/Message Passing Interface UsersâĂŹ Group Meeting. Springer, 257--267.Google Scholar
- H. M. Tufo and P. F. Fischer. 1999. Terascale Spectral Element Algorithms and Implementations. In Proc. of the ACM/IEEE SC99 Conf. on High Performance Networking and Computing, Gordon Bell Prize. IEEE Computer Soc., CDROM. Google ScholarDigital Library
- Isaías A Comprés Ureña, Michael Riepen, and Michael Konow. 2011. RCKMPI-lightweight MPI implementation for IntelâĂŹs Single-chip Cloud Computer (SCC). In European MPI Users' Group Meeting. Springer, 208--217. Google ScholarDigital Library
- M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. J. Van Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus, and W. A. de Jong. 2010. NWChem: A Comprehensive and Scalable Open-Source Solution for Large Scale Molecular Simulations. Computer Physics Communications 181, 9 (2010), 1477--1489.Google ScholarCross Ref
Index Terms
- Why is MPI so slow?: analyzing the fundamental limits in implementing MPI-3.1
Recommendations
MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory
Hybrid parallel programming with the message passing interface (MPI) for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While ...
MPI: past, present and future
PVM/MPI'07: Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing InterfaceThis talk will trace the origins of MPI from the early message-passing, distributed memory, parallel computers in the 1980's, to today's parallel supercomputers. In these early days, parallel computing companies implemented proprietary message-passing ...
Comments