skip to main content
10.1145/2063384.2063487acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Improving communication performance in dense linear algebra via topology aware collectives

Published: 12 November 2011 Publication History

Abstract

Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are significantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP-based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.

References

[1]
R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev., 39:575--582, September 1995.
[2]
A. Aggarwal, A. K. Chandra, and M. Snir. Communication complexity of PRAMs. Theoretical Computer Science, 71(1):3--28, 1990.
[3]
A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: incorporating long messages into the LogP modelone step closer towards a realistic model for parallel computation. In Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, SPAA '95, pages 95--105, New York, NY, USA, 1995. ACM.
[4]
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in linear algebra. To appear in SIAM J. Mat. Anal. Appl, 2011.
[5]
M. Barnett, D. G. Payne, R. A. van de Geijn, and J. Watts. Broadcasting on meshes with worm-hole routing. Technical report, Austin, TX, USA, 1993.
[6]
J. Berntsen. Communication efficient matrix multiplication on hypercubes. Parallel Computing, 12(3):335--342, 1989.
[7]
L. S. Blackford, J. Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK user's guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997.
[8]
D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: towards a realistic model of parallel computation. In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPOPP '93, pages 1--12, New York, NY, USA, 1993. ACM.
[9]
A. Faraj, S. Kumar, B. Smith, A. Mamidala, and J. Gunnels. MPI collective communications on the Blue Gene/P supercomputer: Algorithms and optimizations. In High Performance Interconnects, 2009. HOTI 2009. 17th IEEE Symposium on, pages 63--72, 2009.
[10]
L. Grigori, J. W. Demmel, and H. Xiang. Communication avoiding Gaussian Elimination. pages 29:1--29:12, 2008.
[11]
R. W. Hockney. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Computing, 20(3):389--398, 1994.
[12]
D. Irony and S. Toledo. Trading replication for communication in parallel distributed-memory dense solvers. Parallel Processing Letters, 71:3--28, 2002.
[13]
S. L. Johnsson and C.-T. Ho. Optimum broadcasting and personalized communication in hypercubes. IEEE Trans. Comput., 38:1249--1268, September 1989.
[14]
J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA '08, pages 77--88, Washington, DC, USA, 2008. IEEE Computer Society.
[15]
S. Kumar, G. Dozsa, G. Almasi, P. Heidelberger, D. Chen, M. E. Giampapa, B. Michael, A. Faraj, J. Parker, J. Ratterman, B. Smith, and C. J. Archer. The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In Proceedings of the 22nd annual international conference on Supercomputing, ICS '08, pages 94--103, New York, NY, USA, 2008. ACM.
[16]
P. McKinley, Y.-J. Tsai, and D. Robinson. Collective communication in wormhole-routed massively parallel computers. Computer, 28(12):39--50, Dec. 1995.
[17]
J. Pješivac-Grbović, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, and J. J. Dongarra. Performance analysis of MPI collective operations. Cluster Computing, 10:127--143, June 2007.
[18]
E. Solomonik and J. Demmel. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. Technical Report UCB/EECS-2011-10, EECS Department, University of California, Berkeley, Feb 2011.
[19]
R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, 19(1):49--66, Spring 2005.
[20]
J. Torrellas. Architectures for extreme-scale computing, nov. 2009.
[21]
R. A. Van De Geijn and J. Watts. SUMMA: scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience, 9(4):255--274, 1997.

Cited By

View all
  • (2024)Democratizing AI: Open-source Scalable LLM Training on GPU-based SupercomputersSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00010(1-14)Online publication date: 17-Nov-2024
  • (2022)ParaGraph: An application-simulator interface and toolkit for hardware-software co-designProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545069(1-13)Online publication date: 29-Aug-2022
  • (2022)Stark: Fast and Scalable Strassen’s Matrix Multiplication Using Apache SparkIEEE Transactions on Big Data10.1109/TBDATA.2020.29773268:3(699-710)Online publication date: 1-Jun-2022
  • Show More Cited By

Index Terms

  1. Improving communication performance in dense linear algebra via topology aware collectives

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
      November 2011
      866 pages
      ISBN:9781450307710
      DOI:10.1145/2063384
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 November 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. communication
      2. exascale
      3. interconnect topology
      4. mapping
      5. performance

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      SC '11
      Sponsor:

      Acceptance Rates

      SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;
      Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)14
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 19 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Democratizing AI: Open-source Scalable LLM Training on GPU-based SupercomputersSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00010(1-14)Online publication date: 17-Nov-2024
      • (2022)ParaGraph: An application-simulator interface and toolkit for hardware-software co-designProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545069(1-13)Online publication date: 29-Aug-2022
      • (2022)Stark: Fast and Scalable Strassen’s Matrix Multiplication Using Apache SparkIEEE Transactions on Big Data10.1109/TBDATA.2020.29773268:3(699-710)Online publication date: 1-Jun-2022
      • (2022)Existing HPC Methods and the Communication Lower Bounds for Distributed-Memory Computations for Mass Spectrometry-Based Omics DataHigh-Performance Algorithms for Mass Spectrometry-Based Omics10.1007/978-3-031-01960-9_3(21-35)Online publication date: 3-Sep-2022
      • (2022)Need for High-Performance Computing for MS-Based Omics Data AnalysisHigh-Performance Algorithms for Mass Spectrometry-Based Omics10.1007/978-3-031-01960-9_1(1-5)Online publication date: 3-Sep-2022
      • (2021)Communication Lower-Bounds for Distributed-Memory Computations for Mass Spectrometry based Omics DataJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.11.001Online publication date: Nov-2021
      • (2020)Reducing communication in algebraic multigrid with multi-step node aware communicationThe International Journal of High Performance Computing Applications10.1177/1094342020925535(109434202092553)Online publication date: 11-Jun-2020
      • (2019)Shortest paths in Dragonfly systems2019 International Workshop of High-Perfomance Interconnection Networks in the Exascale and Big-Data Era (HiPNEB)10.1109/HiPINEB.2019.00008(1-8)Online publication date: Feb-2019
      • (2019)Node aware sparse matrix–vector multiplicationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.03.016130:C(166-178)Online publication date: 1-Aug-2019
      • (2019)Efficient multi-resource scheduling algorithm for hybrid cloud-based large-scale media streamingComputers and Electrical Engineering10.1016/j.compeleceng.2019.02.00775:C(123-134)Online publication date: 1-May-2019
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media