ABSTRACT
We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to sub-optimal scalability and fails to exploit the sparsity in the problem. To address these challenges, we propose decomposing the sparse matrix into a small number of highly structured matrices called arrow matrices, which are connected by permutations. Our approach enables communication-avoiding multiplications, achieving a polynomial reduction in communication volume per iteration for matrices corresponding to planar graphs and other minor-excluded families of graphs. Our evaluation demonstrates that our approach outperforms a state-of-the-art method for sparse matrix multiplication on matrices with hundreds of millions of rows, offering near-linear strong and weak scaling.
Supplemental Material
Available for Download
Supplemental material.
- Acer, S., Selvitopi, R. O., and Aykanat, C. Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems. Parallel Comput. 59 (2016), 71--96.Google ScholarDigital Library
- Albert, R., and Barabási, A.-L. Statistical mechanics of complex networks. Rev. Mod. Phys. 74 (Jan 2002), 47--97.Google ScholarCross Ref
- Albert, R., Jeong, H., and Barabási, A. Diameter of the world-wide web. 130--131.Google Scholar
- Alemany-Puig, L., Esteban, J. L., and Ferrer-i-Cancho, R. Minimum projective linearizations of trees in linear time. Inf. Process. Lett. 174 (2022), 106204.Google ScholarDigital Library
- Balakrishnan, N., and Nevzorov, V. B. A primer on statistical distributions. John Wiley & Sons, 2004.Google Scholar
- Ballard, G., Druinsky, A., Knight, N., and Schwartz, O. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput. 3, 3 (2016), 18:1--18:34.Google ScholarDigital Library
- Boamah-Addo, K., Kozubowski, T., and Panorska, A. A discrete truncated zipf distribution. Statistica Neerlandica (09 2022).Google Scholar
- Boman, E. G., Devine, K. D., and Rajamanickam, S. Scalable matrix computations on large scale-free graphs using 2d graph partitioning. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC'13, Denver, CO, USA - November 17 -- 21, 2013 (2013), W. Gropp and S. Matsuoka, Eds., ACM, pp. 50:1--50:12.Google ScholarDigital Library
- Borstnik, U., VandeVondele, J., Weber, V., and Hutter, J. Sparse matrix multiplication: The distributed block-compressed sparse row library. Parallel Comput. 40, 5--6 (2014), 47--58.Google ScholarCross Ref
- Böttcher, J., Pruessmann, K. P., Taraz, A., and Würfl, A. Bandwidth, expansion, treewidth, separators and universality for bounded-degree graphs. Eur. J. Comb. 31, 5 (2010), 1217--1227.Google ScholarDigital Library
- Buono, D., Petrini, F., Checconi, F., Liu, X., Que, X., Long, C., and Tuan, T. Optimizing sparse matrix-vector multiplication for large-scale data analytics. In Proceedings of the 2016 International Conference on Supercomputing, ICS 2016, Istanbul, Turkey, June 1--3, 2016 (2016), O. Ozturk, K. Ebcioglu, M. T. Kandemir, and O. Mutlu, Eds., ACM, pp. 37:1--37:12.Google ScholarDigital Library
- Çatalyürek, Ü. V., and Aykanat, C. Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans. Parallel Distributed Syst. 10, 7 (1999), 673--693.Google ScholarDigital Library
- Chan, E., Heimlich, M., Purkayastha, A., and van de Geijn, R. A. Collective communication: theory, practice, and experience. Concurr. Comput. Pract. Exp. 19, 13 (2007), 1749--1783.Google ScholarCross Ref
- Charikar, M., Hajiaghayi, M. T., Karloff, H. J., and Rao, S. l22 spreading metrics for vertex ordering problems. Algorithmica 56, 4 (2010), 577--604.Google ScholarCross Ref
- Chinn, P. Z., Chvatalova, J., Dewdney, A. K., and Gibbs, N. E. The bandwidth problem for graphs and matrices - a survey. J. Graph Theory 6, 3 (1982), 223--254.Google ScholarCross Ref
- Cuthill, E. H., and McKee, J. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 24th national conference, ACM 1969, USA, 1969 (1969), S. L. Pollack, T. R. Dines, W. C. Sangren, N. R. Nielsen, W. G. Gerkin, A. E. Corduan, L. Nowak, J. L. Mueller, J. H. III, P. S. T. Yuen, J. Stein, and M. M. Mueller, Eds., ACM, pp. 157--172.Google ScholarDigital Library
- Dalcín, L., Paz, R., and Storti, M. Mpi for python. Journal of Parallel and Distributed Computing 65, 9 (2005), 1108--1115.Google ScholarDigital Library
- Davis, T. A., and Hu, Y. The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (dec 2011).Google Scholar
- Devine, K. D., Boman, E. G., Heaphy, R. T., Bisseling, R. H., and Çatalyürek, Ü. V. Parallel hypergraph partitioning for scientific computing. In 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), Proceedings, 25--29 April 2006, Rhodes Island, Greece (2006), IEEE.Google ScholarCross Ref
- Eikel, M., Scheideler, C., and Setzer, A. Minimum linear arrangement of series-parallel graphs. In Approximation and Online Algorithms - 12th International Workshop, WAOA 2014, Wrocław, Poland, September 11--12, 2014, Revised Selected Papers (2014), E. Bampis and O. Svensson, Eds., vol. 8952 of Lecture Notes in Computer Science, Springer, pp. 168--180.Google Scholar
- Feige, U. Approximating the bandwidth via volume respecting em-beddings. J. Comput. Syst. Sci. 60, 3 (2000), 510--539.Google ScholarDigital Library
- Feige, U. Coping with the np-hardness of the graph bandwidth problem. In Algorithm Theory - SWAT 2000, 7th Scandinavian Workshop on Algorithm Theory, Bergen, Norway, July 5--7, 2000, Proceedings (2000), M. M. Halldörsson, Ed., vol. 1851 of Lecture Notes in Computer Science, Springer, pp. 10--19.Google Scholar
- Geng, T., Wu, C., Zhang, Y., Tan, C., Xie, C., You, H., Herbordt, M. C., Lin, Y., and Li, A. I-GCN: A graph convolutional network accelerator with runtime locality enhancement through islandization. In MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual Event, Greece, October 18--22, 2021 (2021), ACM, pp. 1051--1063.Google ScholarDigital Library
- Golub, G. H., and Van Loan, C. F. Matrix computations. JHU press, 2013.Google ScholarCross Ref
- Graham, R. L., Knuth, D. E., and Patashnik, O. Concrete Mathematics: A Foundation for Computer Science, second ed. Addison-Wesley, Reading, MA, 1994.Google Scholar
- Gravvanis, G. A. An approximate inverse matrix technique for arrowhead matrices. Int. J. Comput. Math. 70, 1 (1998), 35--45.Google ScholarCross Ref
- Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. Array programming with NumPy. Nature 585, 7825 (Sept. 2020), 357--362.Google Scholar
- Kawarabayashi, K., and Reed, B. A. A separator theorem in minor-closed classes. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23--26, 2010, Las Vegas, Nevada, USA (2010), IEEE Computer Society, pp. 153--162.Google ScholarDigital Library
- Kaya, K., Uçar, B., and Çatalyürek, Ü. V. Analysis of partitioning models and metrics in parallel sparse matrix-vector multiplication. In Parallel Processing and Applied Mathematics - 10th International Conference, PPAM 2013, Warsaw, Poland, September 8--11, 2013, Revised Selected Papers, Part II (2013), R. Wyrzykowski, J. J. Dongarra, K. Karczewski, and J. Wasniewski, Eds., vol. 8385 of Lecture Notes in Computer Science, Springer, pp. 174--184.Google Scholar
- Koanantakool, P., Azad, A., Buluç, A., Morozov, D., Oh, S., Oliker, L., and Yelick, K. A. Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In 2016 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016, Chicago, IL, USA, May 23--27, 2016 (2016), IEEE Computer Society, pp. 842--853.Google ScholarCross Ref
- Kozubowski, T. J., Panorska, A. K., and Forister, M. L. A discrete truncated pareto distribution. Statistical Methodology 26 (2015), 135--150.Google ScholarCross Ref
- Lanczos, C. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. Journal of Research of the National Bureau of Standards 45, 4 (Oct. 1950).Google ScholarCross Ref
- Lipton, R. J., and Tarjan, R. E. A separator theorem for planar graphs. SIAM Journal on Applied Mathematics 36, 2 (1979), 177--189.Google ScholarDigital Library
- Mayer, C., Mayer, R., Bhowmik, S., Epple, L., and Rothermel, K. HYPE: massive hypergraph partitioning with neighborhood expansion. In IEEE International Conference on Big Data (IEEE BigData 2018), Seattle, WA, USA, December 10--13, 2018 (2018), N. Abe, H. Liu, C. Pu, X. Hu, N. K. Ahmed, M. Qiao, Y. Song, D. Kossmann, B. Liu, K. Lee, J. Tang, J. He, and J. S. Saltz, Eds., IEEE, pp. 458--467.Google ScholarCross Ref
- Muradian, D. The bandwidth minimization problem for cyclic caterpillars with hair length 1 is np-complete. Theor. Comput. Sci. 307, 3 (2003), 567--572.Google ScholarDigital Library
- Okuta, R., Unno, Y., Nishino, D., Hido, S., and Loomis, C. Cupy: A numpy-compatible library for nvidia gpu calculations. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) (2017).Google Scholar
- Page, B. A., and Kogge, P. M. Scalability of hybrid spmv with hyper-graph partitioning and vertex delegation for communication avoidance. In International Conference on High Performance Computing & Simulation (HPCS 2020) (2021).Google Scholar
- PAIGE, C. C. Computational Variants of the Lanczos Method for the Eigenproblem. IMA Journal of Applied Mathematics 10, 3 (12 1972), 373--381.Google Scholar
- Papadimitriou, C. H. The np-completeness of the bandwidth minimization problem. Computing 16, 3 (1976), 263--270.Google ScholarCross Ref
- Plotkin, S. A., Rao, S., and Smith, W. D. Shallow excluded minors and improved graph decompositions. In Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms. 23--25 January 1994, Arlington, Virginia, USA (1994), D. D. Sleator, Ed., ACM/SIAM, pp. 462--470.Google ScholarDigital Library
- Rao, S., and Richa, A. W. New approximation techniques for some linear ordering problems. SIAM J. Comput. 34, 2 (2004), 388--404.Google Scholar
- Raoufi, P., Rostami, H., and Bagherinezhad, H. An optimal time algorithm for minimum linear arrangement of chord graphs. Inf. Sci. 238 (2013), 212--220.Google ScholarDigital Library
- Robertson, N., and Seymour, P. Graph minors. vi. disjoint paths across a disc. Journal of Combinatorial Theory, Series B 41, 1 (1986), 115--138.Google ScholarDigital Library
- Schatz, M. D., van de Geijn, R. A., and Poulson, J. Parallel matrix multiplication: A systematic journey. SIAM J. Sci. Comput. 38, 6 (2016).Google ScholarDigital Library
- Selvitopi, O., Brock, B., Nisa, I., Tripathy, A., Yelick, K. A., and Buluç, A. Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication. In ICS '21: 2021 International Conference on Supercomputing, Virtual Event, USA, June 14--17, 2021 (2021), H. Zhou, J. Moreira, F. Mueller, and Y. Etsion, Eds., ACM, pp. 431--442.Google ScholarDigital Library
- Sun, J., Vandierendonck, H., and Nikolopoulos, D. S. Graphgrind: addressing load imbalance of graph partitioning. In Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA, June 14--16, 2017 (2017), W. D. Gropp, P. Beckman, Z. Li, and F. J. Cazorla, Eds., ACM, pp. 16:1--16:10.Google ScholarDigital Library
- Tripathy, A., Yelick, K. A., and Buluç, A. Reducing communication in graph neural network training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9--19, 2020 (2020), C. Cuicchi, I. Qualters, and W. T. Kramer, Eds., IEEE/ACM, p. 70.Google ScholarCross Ref
- Turner, J. S. On the probable performance of heuristics for bandwidth minimization. SIAM J. Comput. 15, 2 (1986), 561--580.Google ScholarDigital Library
- Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ., Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17 (2020), 261--272.Google Scholar
- Zheng, D., Mhembere, D., Lyzinski, V., Vogelstein, J. T., Priebe, C. E., and Burns, R. C. Semi-external memory sparse matrix multiplication for billion-node graphs. IEEE Trans. Parallel Distributed Syst. 28, 5 (2017), 1470--1483.Google ScholarDigital Library
- Zipf, G. K. Human behavior and the principle of least effort: An introduction to human ecology. Addison-Wesley, Cambridge, Massachusetts, 1949.Google Scholar
Index Terms
- Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication
Recommendations
A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel ProgrammingSparse-Matrix Dense-Matrix Multiplication (SpMM) and Sampled Dense Dense Matrix Multiplication (SDDMM) are important sparse kernels in various computation domains. The uneven distribution of nonzeros in the sparse matrix and the tight data dependence ...
Sparse Matrix Multiplication On An Associative Processor
Sparse matrix multiplication is an important component of linear algebra computations. Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in parallel with ...
On Implementing Sparse Matrix Multi-vector Multiplication on GPUs
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)Sparse matrix-vector and multi-vector multiplications (SpMV and SpMM) are performance bottlenecks operations in numerous HPC applications. A variety of SpMV GPU kernels using different matrix storage formats have been developed to accelerate these ...
Comments