research-article

Improving communication performance in dense linear algebra via topology aware collectives

Authors:

Edgar Solomonik,

Abhinav Bhatele,

James DemmelAuthors Info & Claims

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 77, Pages 1 - 11

https://doi.org/10.1145/2063384.2063487

Published: 12 November 2011 Publication History

Abstract

Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are significantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP-based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.

References

[1]

R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev., 39:575--582, September 1995.

Digital Library

[2]

A. Aggarwal, A. K. Chandra, and M. Snir. Communication complexity of PRAMs. Theoretical Computer Science, 71(1):3--28, 1990.

Digital Library

[3]

A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: incorporating long messages into the LogP modelone step closer towards a realistic model for parallel computation. In Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, SPAA '95, pages 95--105, New York, NY, USA, 1995. ACM.

Digital Library

[4]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in linear algebra. To appear in SIAM J. Mat. Anal. Appl, 2011.

[5]

M. Barnett, D. G. Payne, R. A. van de Geijn, and J. Watts. Broadcasting on meshes with worm-hole routing. Technical report, Austin, TX, USA, 1993.

Digital Library

[6]

J. Berntsen. Communication efficient matrix multiplication on hypercubes. Parallel Computing, 12(3):335--342, 1989.

[7]

L. S. Blackford, J. Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK user's guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997.

Digital Library

[8]

D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: towards a realistic model of parallel computation. In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPOPP '93, pages 1--12, New York, NY, USA, 1993. ACM.

Digital Library

[9]

A. Faraj, S. Kumar, B. Smith, A. Mamidala, and J. Gunnels. MPI collective communications on the Blue Gene/P supercomputer: Algorithms and optimizations. In High Performance Interconnects, 2009. HOTI 2009. 17th IEEE Symposium on, pages 63--72, 2009.

Digital Library

[10]

L. Grigori, J. W. Demmel, and H. Xiang. Communication avoiding Gaussian Elimination. pages 29:1--29:12, 2008.

Digital Library

[11]

R. W. Hockney. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Computing, 20(3):389--398, 1994.

Digital Library

[12]

D. Irony and S. Toledo. Trading replication for communication in parallel distributed-memory dense solvers. Parallel Processing Letters, 71:3--28, 2002.

[13]

S. L. Johnsson and C.-T. Ho. Optimum broadcasting and personalized communication in hypercubes. IEEE Trans. Comput., 38:1249--1268, September 1989.

Digital Library

[14]

J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA '08, pages 77--88, Washington, DC, USA, 2008. IEEE Computer Society.

Digital Library

[15]

S. Kumar, G. Dozsa, G. Almasi, P. Heidelberger, D. Chen, M. E. Giampapa, B. Michael, A. Faraj, J. Parker, J. Ratterman, B. Smith, and C. J. Archer. The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In Proceedings of the 22nd annual international conference on Supercomputing, ICS '08, pages 94--103, New York, NY, USA, 2008. ACM.

Digital Library

[16]

P. McKinley, Y.-J. Tsai, and D. Robinson. Collective communication in wormhole-routed massively parallel computers. Computer, 28(12):39--50, Dec. 1995.

Digital Library

[17]

J. Pješivac-Grbović, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, and J. J. Dongarra. Performance analysis of MPI collective operations. Cluster Computing, 10:127--143, June 2007.

Digital Library

[18]

E. Solomonik and J. Demmel. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. Technical Report UCB/EECS-2011-10, EECS Department, University of California, Berkeley, Feb 2011.

[19]

R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, 19(1):49--66, Spring 2005.

Digital Library

[20]

J. Torrellas. Architectures for extreme-scale computing, nov. 2009.

Digital Library

[21]

R. A. Van De Geijn and J. Watts. SUMMA: scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience, 9(4):255--274, 1997.

Cited By

Singh SSinghania PRanjan AKirchenbauer JGeiping JWen YJain NHans AShu MTomar AGoldstein TBhatele A(2024)Democratizing AI: Open-source Scalable LLM Training on GPU-based SupercomputersSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00010(1-14)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00010
Isaev MMcDonald NYoung JVuduc R(2022)ParaGraph: An application-simulator interface and toolkit for hardware-software co-designProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545069(1-13)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545069
Misra CBhattacharya SGhosh S(2022)Stark: Fast and Scalable Strassen’s Matrix Multiplication Using Apache SparkIEEE Transactions on Big Data10.1109/TBDATA.2020.29773268:3(699-710)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TBDATA.2020.2977326
Show More Cited By

Index Terms

Improving communication performance in dense linear algebra via topology aware collectives
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms

Recommendations

Benchmarking GPUs to tune dense linear algebra
SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing

We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor's implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky ...
Mixing LU and QR factorization algorithms to design high-performance dense linear algebra solvers

This paper introduces hybrid LU-QR algorithms for solving dense linear systems of the form A x = b . Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps based upon some robustness ...
Optimizing High-Performance Linpack for Exascale Accelerated Architectures
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

We detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

866 pages

ISBN:9781450307710

DOI:10.1145/2063384

Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Krell Department of Energy Computational Science Graduate Fellowship
Intel Corporation
Microsoft
University of California
U.S. Department of Energy

Conference

SC '11

Sponsor:

SIGARCH
IEEE-CS

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 18, 2011

Washington, Seattle

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

60
Total Citations
View Citations
353
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Singh SSinghania PRanjan AKirchenbauer JGeiping JWen YJain NHans AShu MTomar AGoldstein TBhatele A(2024)Democratizing AI: Open-source Scalable LLM Training on GPU-based SupercomputersSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00010(1-14)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00010
Isaev MMcDonald NYoung JVuduc R(2022)ParaGraph: An application-simulator interface and toolkit for hardware-software co-designProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545069(1-13)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545069
Misra CBhattacharya SGhosh S(2022)Stark: Fast and Scalable Strassen’s Matrix Multiplication Using Apache SparkIEEE Transactions on Big Data10.1109/TBDATA.2020.29773268:3(699-710)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TBDATA.2020.2977326
Saeed FHaseeb MSaeed FHaseeb M(2022)Existing HPC Methods and the Communication Lower Bounds for Distributed-Memory Computations for Mass Spectrometry-Based Omics DataHigh-Performance Algorithms for Mass Spectrometry-Based Omics10.1007/978-3-031-01960-9_3(21-35)Online publication date: 3-Sep-2022
https://doi.org/10.1007/978-3-031-01960-9_3
Saeed FHaseeb MSaeed FHaseeb M(2022)Need for High-Performance Computing for MS-Based Omics Data AnalysisHigh-Performance Algorithms for Mass Spectrometry-Based Omics10.1007/978-3-031-01960-9_1(1-5)Online publication date: 3-Sep-2022
https://doi.org/10.1007/978-3-031-01960-9_1
Saeed FHaseeb MIyengar S(2021)Communication Lower-Bounds for Distributed-Memory Computations for Mass Spectrometry based Omics DataJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.11.001Online publication date: Nov-2021
https://doi.org/10.1016/j.jpdc.2021.11.001
Bienz AGropp WOlson L(2020)Reducing communication in algebraic multigrid with multi-step node aware communicationThe International Journal of High Performance Computing Applications10.1177/1094342020925535(109434202092553)Online publication date: 11-Jun-2020
https://doi.org/10.1177/1094342020925535
Curtsinger RBunde D(2019)Shortest paths in Dragonfly systems2019 International Workshop of High-Perfomance Interconnection Networks in the Exascale and Big-Data Era (HiPNEB)10.1109/HiPINEB.2019.00008(1-8)Online publication date: Feb-2019
https://doi.org/10.1109/HiPINEB.2019.00008
Bienz AGropp WOlson L(2019)Node aware sparse matrix–vector multiplicationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.03.016130:C(166-178)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.1016/j.jpdc.2019.03.016
Liu YWei WXu H(2019)Efficient multi-resource scheduling algorithm for hybrid cloud-based large-scale media streamingComputers and Electrical Engineering10.1016/j.compeleceng.2019.02.00775:C(123-134)Online publication date: 1-May-2019
https://dl.acm.org/doi/10.1016/j.compeleceng.2019.02.007
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten