research-article

Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication

Authors:
Lukas Gianinazzi

Computer Science, ETH Zurich, Zurich, Switzerland

Computer Science, ETH Zurich, Zurich, Switzerland

https://orcid.org/0000-0001-5975-4526
View Profile

,
Alexandros Nikolaos Ziogas

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland

https://orcid.org/0000-0002-4328-9751
View Profile

,
Langwen Huang

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland

https://orcid.org/0000-0002-9204-0346
View Profile

,
Piotr Luczynski

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland

https://orcid.org/0000-0002-8779-4223
View Profile

,
Saleh Ashkboosh

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland

https://orcid.org/0000-0001-6115-6779
View Profile

,
Florian Scheidl

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland

https://orcid.org/0009-0000-5766-894X
View Profile

,
Armon Carigiet

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland

https://orcid.org/0009-0002-3555-767X
View Profile

,
Chio Ge

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland

https://orcid.org/0009-0005-3735-6160
View Profile

,
Nabil Abubaker

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland

https://orcid.org/0000-0002-5060-3059
View Profile

,
Maciej Besta

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland

https://orcid.org/0000-0002-6550-7916
View Profile

,
Tal Ben-Nun

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland

https://orcid.org/0000-0002-3657-6568
View Profile

,
Torsten Hoefler

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland

https://orcid.org/0000-0002-1333-9797
View Profile

PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel ProgrammingMarch 2024Pages 404–416https://doi.org/10.1145/3627535.3638496

Published:20 February 2024Publication History

PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Pages 404–416

ABSTRACT

We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to sub-optimal scalability and fails to exploit the sparsity in the problem. To address these challenges, we propose decomposing the sparse matrix into a small number of highly structured matrices called arrow matrices, which are connected by permutations. Our approach enables communication-avoiding multiplications, achieving a polynomial reduction in communication volume per iteration for matrices corresponding to planar graphs and other minor-excluded families of graphs. Our evaluation demonstrates that our approach outperforms a state-of-the-art method for sparse matrix multiplication on matrices with hundreds of millions of rows, offering near-linear strong and weak scaling.

Supplemental Material

Available for Download

pdf

p404-gianinazzi-supp.pdf (519 KB)

Supplemental material.

References

Acer, S., Selvitopi, R. O., and Aykanat, C. Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems. Parallel Comput. 59 (2016), 71--96.Google ScholarDigital Library
Albert, R., and Barabási, A.-L. Statistical mechanics of complex networks. Rev. Mod. Phys. 74 (Jan 2002), 47--97.Google ScholarCross Ref
Albert, R., Jeong, H., and Barabási, A. Diameter of the world-wide web. 130--131.Google Scholar
Alemany-Puig, L., Esteban, J. L., and Ferrer-i-Cancho, R. Minimum projective linearizations of trees in linear time. Inf. Process. Lett. 174 (2022), 106204.Google ScholarDigital Library
Balakrishnan, N., and Nevzorov, V. B. A primer on statistical distributions. John Wiley & Sons, 2004.Google Scholar
Ballard, G., Druinsky, A., Knight, N., and Schwartz, O. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput. 3, 3 (2016), 18:1--18:34.Google ScholarDigital Library
Boamah-Addo, K., Kozubowski, T., and Panorska, A. A discrete truncated zipf distribution. Statistica Neerlandica (09 2022).Google Scholar
Boman, E. G., Devine, K. D., and Rajamanickam, S. Scalable matrix computations on large scale-free graphs using 2d graph partitioning. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC'13, Denver, CO, USA - November 17 -- 21, 2013 (2013), W. Gropp and S. Matsuoka, Eds., ACM, pp. 50:1--50:12.Google ScholarDigital Library
Borstnik, U., VandeVondele, J., Weber, V., and Hutter, J. Sparse matrix multiplication: The distributed block-compressed sparse row library. Parallel Comput. 40, 5--6 (2014), 47--58.Google ScholarCross Ref
Böttcher, J., Pruessmann, K. P., Taraz, A., and Würfl, A. Bandwidth, expansion, treewidth, separators and universality for bounded-degree graphs. Eur. J. Comb. 31, 5 (2010), 1217--1227.Google ScholarDigital Library
Buono, D., Petrini, F., Checconi, F., Liu, X., Que, X., Long, C., and Tuan, T. Optimizing sparse matrix-vector multiplication for large-scale data analytics. In Proceedings of the 2016 International Conference on Supercomputing, ICS 2016, Istanbul, Turkey, June 1--3, 2016 (2016), O. Ozturk, K. Ebcioglu, M. T. Kandemir, and O. Mutlu, Eds., ACM, pp. 37:1--37:12.Google ScholarDigital Library
Çatalyürek, Ü. V., and Aykanat, C. Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans. Parallel Distributed Syst. 10, 7 (1999), 673--693.Google ScholarDigital Library
Chan, E., Heimlich, M., Purkayastha, A., and van de Geijn, R. A. Collective communication: theory, practice, and experience. Concurr. Comput. Pract. Exp. 19, 13 (2007), 1749--1783.Google ScholarCross Ref
Charikar, M., Hajiaghayi, M. T., Karloff, H. J., and Rao, S. l₂² spreading metrics for vertex ordering problems. Algorithmica 56, 4 (2010), 577--604.Google ScholarCross Ref
Chinn, P. Z., Chvatalova, J., Dewdney, A. K., and Gibbs, N. E. The bandwidth problem for graphs and matrices - a survey. J. Graph Theory 6, 3 (1982), 223--254.Google ScholarCross Ref
Cuthill, E. H., and McKee, J. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 24th national conference, ACM 1969, USA, 1969 (1969), S. L. Pollack, T. R. Dines, W. C. Sangren, N. R. Nielsen, W. G. Gerkin, A. E. Corduan, L. Nowak, J. L. Mueller, J. H. III, P. S. T. Yuen, J. Stein, and M. M. Mueller, Eds., ACM, pp. 157--172.Google ScholarDigital Library
Dalcín, L., Paz, R., and Storti, M. Mpi for python. Journal of Parallel and Distributed Computing 65, 9 (2005), 1108--1115.Google ScholarDigital Library
Davis, T. A., and Hu, Y. The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (dec 2011).Google Scholar
Devine, K. D., Boman, E. G., Heaphy, R. T., Bisseling, R. H., and Çatalyürek, Ü. V. Parallel hypergraph partitioning for scientific computing. In 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), Proceedings, 25--29 April 2006, Rhodes Island, Greece (2006), IEEE.Google ScholarCross Ref
Eikel, M., Scheideler, C., and Setzer, A. Minimum linear arrangement of series-parallel graphs. In Approximation and Online Algorithms - 12th International Workshop, WAOA 2014, Wrocław, Poland, September 11--12, 2014, Revised Selected Papers (2014), E. Bampis and O. Svensson, Eds., vol. 8952 of Lecture Notes in Computer Science, Springer, pp. 168--180.Google Scholar
Feige, U. Approximating the bandwidth via volume respecting em-beddings. J. Comput. Syst. Sci. 60, 3 (2000), 510--539.Google ScholarDigital Library
Feige, U. Coping with the np-hardness of the graph bandwidth problem. In Algorithm Theory - SWAT 2000, 7th Scandinavian Workshop on Algorithm Theory, Bergen, Norway, July 5--7, 2000, Proceedings (2000), M. M. Halldörsson, Ed., vol. 1851 of Lecture Notes in Computer Science, Springer, pp. 10--19.Google Scholar
Geng, T., Wu, C., Zhang, Y., Tan, C., Xie, C., You, H., Herbordt, M. C., Lin, Y., and Li, A. I-GCN: A graph convolutional network accelerator with runtime locality enhancement through islandization. In MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual Event, Greece, October 18--22, 2021 (2021), ACM, pp. 1051--1063.Google ScholarDigital Library
Golub, G. H., and Van Loan, C. F. Matrix computations. JHU press, 2013.Google ScholarCross Ref
Graham, R. L., Knuth, D. E., and Patashnik, O. Concrete Mathematics: A Foundation for Computer Science, second ed. Addison-Wesley, Reading, MA, 1994.Google Scholar
Gravvanis, G. A. An approximate inverse matrix technique for arrowhead matrices. Int. J. Comput. Math. 70, 1 (1998), 35--45.Google ScholarCross Ref
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. Array programming with NumPy. Nature 585, 7825 (Sept. 2020), 357--362.Google Scholar
Kawarabayashi, K., and Reed, B. A. A separator theorem in minor-closed classes. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23--26, 2010, Las Vegas, Nevada, USA (2010), IEEE Computer Society, pp. 153--162.Google ScholarDigital Library
Kaya, K., Uçar, B., and Çatalyürek, Ü. V. Analysis of partitioning models and metrics in parallel sparse matrix-vector multiplication. In Parallel Processing and Applied Mathematics - 10th International Conference, PPAM 2013, Warsaw, Poland, September 8--11, 2013, Revised Selected Papers, Part II (2013), R. Wyrzykowski, J. J. Dongarra, K. Karczewski, and J. Wasniewski, Eds., vol. 8385 of Lecture Notes in Computer Science, Springer, pp. 174--184.Google Scholar
Koanantakool, P., Azad, A., Buluç, A., Morozov, D., Oh, S., Oliker, L., and Yelick, K. A. Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In 2016 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016, Chicago, IL, USA, May 23--27, 2016 (2016), IEEE Computer Society, pp. 842--853.Google ScholarCross Ref
Kozubowski, T. J., Panorska, A. K., and Forister, M. L. A discrete truncated pareto distribution. Statistical Methodology 26 (2015), 135--150.Google ScholarCross Ref
Lanczos, C. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. Journal of Research of the National Bureau of Standards 45, 4 (Oct. 1950).Google ScholarCross Ref
Lipton, R. J., and Tarjan, R. E. A separator theorem for planar graphs. SIAM Journal on Applied Mathematics 36, 2 (1979), 177--189.Google ScholarDigital Library
Mayer, C., Mayer, R., Bhowmik, S., Epple, L., and Rothermel, K. HYPE: massive hypergraph partitioning with neighborhood expansion. In IEEE International Conference on Big Data (IEEE BigData 2018), Seattle, WA, USA, December 10--13, 2018 (2018), N. Abe, H. Liu, C. Pu, X. Hu, N. K. Ahmed, M. Qiao, Y. Song, D. Kossmann, B. Liu, K. Lee, J. Tang, J. He, and J. S. Saltz, Eds., IEEE, pp. 458--467.Google ScholarCross Ref
Muradian, D. The bandwidth minimization problem for cyclic caterpillars with hair length 1 is np-complete. Theor. Comput. Sci. 307, 3 (2003), 567--572.Google ScholarDigital Library
Okuta, R., Unno, Y., Nishino, D., Hido, S., and Loomis, C. Cupy: A numpy-compatible library for nvidia gpu calculations. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) (2017).Google Scholar
Page, B. A., and Kogge, P. M. Scalability of hybrid spmv with hyper-graph partitioning and vertex delegation for communication avoidance. In International Conference on High Performance Computing & Simulation (HPCS 2020) (2021).Google Scholar
PAIGE, C. C. Computational Variants of the Lanczos Method for the Eigenproblem. IMA Journal of Applied Mathematics 10, 3 (12 1972), 373--381.Google Scholar
Papadimitriou, C. H. The np-completeness of the bandwidth minimization problem. Computing 16, 3 (1976), 263--270.Google ScholarCross Ref
Plotkin, S. A., Rao, S., and Smith, W. D. Shallow excluded minors and improved graph decompositions. In Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms. 23--25 January 1994, Arlington, Virginia, USA (1994), D. D. Sleator, Ed., ACM/SIAM, pp. 462--470.Google ScholarDigital Library
Rao, S., and Richa, A. W. New approximation techniques for some linear ordering problems. SIAM J. Comput. 34, 2 (2004), 388--404.Google Scholar
Raoufi, P., Rostami, H., and Bagherinezhad, H. An optimal time algorithm for minimum linear arrangement of chord graphs. Inf. Sci. 238 (2013), 212--220.Google ScholarDigital Library
Robertson, N., and Seymour, P. Graph minors. vi. disjoint paths across a disc. Journal of Combinatorial Theory, Series B 41, 1 (1986), 115--138.Google ScholarDigital Library
Schatz, M. D., van de Geijn, R. A., and Poulson, J. Parallel matrix multiplication: A systematic journey. SIAM J. Sci. Comput. 38, 6 (2016).Google ScholarDigital Library
Selvitopi, O., Brock, B., Nisa, I., Tripathy, A., Yelick, K. A., and Buluç, A. Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication. In ICS '21: 2021 International Conference on Supercomputing, Virtual Event, USA, June 14--17, 2021 (2021), H. Zhou, J. Moreira, F. Mueller, and Y. Etsion, Eds., ACM, pp. 431--442.Google ScholarDigital Library
Sun, J., Vandierendonck, H., and Nikolopoulos, D. S. Graphgrind: addressing load imbalance of graph partitioning. In Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA, June 14--16, 2017 (2017), W. D. Gropp, P. Beckman, Z. Li, and F. J. Cazorla, Eds., ACM, pp. 16:1--16:10.Google ScholarDigital Library
Tripathy, A., Yelick, K. A., and Buluç, A. Reducing communication in graph neural network training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9--19, 2020 (2020), C. Cuicchi, I. Qualters, and W. T. Kramer, Eds., IEEE/ACM, p. 70.Google ScholarCross Ref
Turner, J. S. On the probable performance of heuristics for bandwidth minimization. SIAM J. Comput. 15, 2 (1986), 561--580.Google ScholarDigital Library
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ., Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17 (2020), 261--272.Google Scholar
Zheng, D., Mhembere, D., Lyzinski, V., Vogelstein, J. T., Priebe, C. E., and Burns, R. C. Semi-external memory sparse matrix multiplication for billion-node graphs. IEEE Trans. Parallel Distributed Syst. 28, 5 (2017), 1470--1483.Google ScholarDigital Library
Zipf, G. K. Human behavior and the principle of least effort: An introduction to human ecology. Addison-Wesley, Cambridge, Massachusetts, 1949.Google Scholar

Index Terms

Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms
  2. Parallel computing methodologies
    1. Parallel algorithms

Recommendations

A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Sparse-Matrix Dense-Matrix Multiplication (SpMM) and Sampled Dense Dense Matrix Multiplication (SDDMM) are important sparse kernels in various computation domains. The uneven distribution of nonzeros in the sparse matrix and the tight data dependence ...
Read More
Sparse Matrix Multiplication On An Associative Processor
Sparse matrix multiplication is an important component of linear algebra computations. Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in parallel with ...
Read More
On Implementing Sparse Matrix Multi-vector Multiplication on GPUs
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)

Sparse matrix-vector and multi-vector multiplications (SpMV and SpMM) are performance bottlenecks operations in numerous HPC applications. A variety of SpMV GPU kernels using different matrix storage formats have been developed to accelerate these ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
March 2024
498 pages
ISBN:9798400704352
DOI:10.1145/3627535
Chair:
Michel Steuwer,
Program Chairs:
I-Ting Angelina Lee,
Milind Chabbi
Uber Technologies Inc.
Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 February 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
matrix multiplication
sparsity
sparse linear algebra
spmm
graphs
linear arrangement
power law
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate230of1,014submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 458
  Total Downloads
- Downloads (Last 12 months)458
- Downloads (Last 6 weeks)80
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication

PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs

Sparse Matrix Multiplication On An Associative Processor

On Implementing Sparse Matrix Multi-vector Multiplication on GPUs