skip to main content
10.1145/3627535.3638496acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication

Authors Info & Claims
Published:20 February 2024Publication History

ABSTRACT

We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to sub-optimal scalability and fails to exploit the sparsity in the problem. To address these challenges, we propose decomposing the sparse matrix into a small number of highly structured matrices called arrow matrices, which are connected by permutations. Our approach enables communication-avoiding multiplications, achieving a polynomial reduction in communication volume per iteration for matrices corresponding to planar graphs and other minor-excluded families of graphs. Our evaluation demonstrates that our approach outperforms a state-of-the-art method for sparse matrix multiplication on matrices with hundreds of millions of rows, offering near-linear strong and weak scaling.

Skip Supplemental Material Section

Supplemental Material

References

  1. Acer, S., Selvitopi, R. O., and Aykanat, C. Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems. Parallel Comput. 59 (2016), 71--96.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Albert, R., and Barabási, A.-L. Statistical mechanics of complex networks. Rev. Mod. Phys. 74 (Jan 2002), 47--97.Google ScholarGoogle ScholarCross RefCross Ref
  3. Albert, R., Jeong, H., and Barabási, A. Diameter of the world-wide web. 130--131.Google ScholarGoogle Scholar
  4. Alemany-Puig, L., Esteban, J. L., and Ferrer-i-Cancho, R. Minimum projective linearizations of trees in linear time. Inf. Process. Lett. 174 (2022), 106204.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Balakrishnan, N., and Nevzorov, V. B. A primer on statistical distributions. John Wiley & Sons, 2004.Google ScholarGoogle Scholar
  6. Ballard, G., Druinsky, A., Knight, N., and Schwartz, O. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput. 3, 3 (2016), 18:1--18:34.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Boamah-Addo, K., Kozubowski, T., and Panorska, A. A discrete truncated zipf distribution. Statistica Neerlandica (09 2022).Google ScholarGoogle Scholar
  8. Boman, E. G., Devine, K. D., and Rajamanickam, S. Scalable matrix computations on large scale-free graphs using 2d graph partitioning. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC'13, Denver, CO, USA - November 17 -- 21, 2013 (2013), W. Gropp and S. Matsuoka, Eds., ACM, pp. 50:1--50:12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Borstnik, U., VandeVondele, J., Weber, V., and Hutter, J. Sparse matrix multiplication: The distributed block-compressed sparse row library. Parallel Comput. 40, 5--6 (2014), 47--58.Google ScholarGoogle ScholarCross RefCross Ref
  10. Böttcher, J., Pruessmann, K. P., Taraz, A., and Würfl, A. Bandwidth, expansion, treewidth, separators and universality for bounded-degree graphs. Eur. J. Comb. 31, 5 (2010), 1217--1227.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Buono, D., Petrini, F., Checconi, F., Liu, X., Que, X., Long, C., and Tuan, T. Optimizing sparse matrix-vector multiplication for large-scale data analytics. In Proceedings of the 2016 International Conference on Supercomputing, ICS 2016, Istanbul, Turkey, June 1--3, 2016 (2016), O. Ozturk, K. Ebcioglu, M. T. Kandemir, and O. Mutlu, Eds., ACM, pp. 37:1--37:12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Çatalyürek, Ü. V., and Aykanat, C. Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans. Parallel Distributed Syst. 10, 7 (1999), 673--693.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chan, E., Heimlich, M., Purkayastha, A., and van de Geijn, R. A. Collective communication: theory, practice, and experience. Concurr. Comput. Pract. Exp. 19, 13 (2007), 1749--1783.Google ScholarGoogle ScholarCross RefCross Ref
  14. Charikar, M., Hajiaghayi, M. T., Karloff, H. J., and Rao, S. l22 spreading metrics for vertex ordering problems. Algorithmica 56, 4 (2010), 577--604.Google ScholarGoogle ScholarCross RefCross Ref
  15. Chinn, P. Z., Chvatalova, J., Dewdney, A. K., and Gibbs, N. E. The bandwidth problem for graphs and matrices - a survey. J. Graph Theory 6, 3 (1982), 223--254.Google ScholarGoogle ScholarCross RefCross Ref
  16. Cuthill, E. H., and McKee, J. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 24th national conference, ACM 1969, USA, 1969 (1969), S. L. Pollack, T. R. Dines, W. C. Sangren, N. R. Nielsen, W. G. Gerkin, A. E. Corduan, L. Nowak, J. L. Mueller, J. H. III, P. S. T. Yuen, J. Stein, and M. M. Mueller, Eds., ACM, pp. 157--172.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Dalcín, L., Paz, R., and Storti, M. Mpi for python. Journal of Parallel and Distributed Computing 65, 9 (2005), 1108--1115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Davis, T. A., and Hu, Y. The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (dec 2011).Google ScholarGoogle Scholar
  19. Devine, K. D., Boman, E. G., Heaphy, R. T., Bisseling, R. H., and Çatalyürek, Ü. V. Parallel hypergraph partitioning for scientific computing. In 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), Proceedings, 25--29 April 2006, Rhodes Island, Greece (2006), IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  20. Eikel, M., Scheideler, C., and Setzer, A. Minimum linear arrangement of series-parallel graphs. In Approximation and Online Algorithms - 12th International Workshop, WAOA 2014, Wrocław, Poland, September 11--12, 2014, Revised Selected Papers (2014), E. Bampis and O. Svensson, Eds., vol. 8952 of Lecture Notes in Computer Science, Springer, pp. 168--180.Google ScholarGoogle Scholar
  21. Feige, U. Approximating the bandwidth via volume respecting em-beddings. J. Comput. Syst. Sci. 60, 3 (2000), 510--539.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Feige, U. Coping with the np-hardness of the graph bandwidth problem. In Algorithm Theory - SWAT 2000, 7th Scandinavian Workshop on Algorithm Theory, Bergen, Norway, July 5--7, 2000, Proceedings (2000), M. M. Halldörsson, Ed., vol. 1851 of Lecture Notes in Computer Science, Springer, pp. 10--19.Google ScholarGoogle Scholar
  23. Geng, T., Wu, C., Zhang, Y., Tan, C., Xie, C., You, H., Herbordt, M. C., Lin, Y., and Li, A. I-GCN: A graph convolutional network accelerator with runtime locality enhancement through islandization. In MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual Event, Greece, October 18--22, 2021 (2021), ACM, pp. 1051--1063.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Golub, G. H., and Van Loan, C. F. Matrix computations. JHU press, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  25. Graham, R. L., Knuth, D. E., and Patashnik, O. Concrete Mathematics: A Foundation for Computer Science, second ed. Addison-Wesley, Reading, MA, 1994.Google ScholarGoogle Scholar
  26. Gravvanis, G. A. An approximate inverse matrix technique for arrowhead matrices. Int. J. Comput. Math. 70, 1 (1998), 35--45.Google ScholarGoogle ScholarCross RefCross Ref
  27. Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. Array programming with NumPy. Nature 585, 7825 (Sept. 2020), 357--362.Google ScholarGoogle Scholar
  28. Kawarabayashi, K., and Reed, B. A. A separator theorem in minor-closed classes. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23--26, 2010, Las Vegas, Nevada, USA (2010), IEEE Computer Society, pp. 153--162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kaya, K., Uçar, B., and Çatalyürek, Ü. V. Analysis of partitioning models and metrics in parallel sparse matrix-vector multiplication. In Parallel Processing and Applied Mathematics - 10th International Conference, PPAM 2013, Warsaw, Poland, September 8--11, 2013, Revised Selected Papers, Part II (2013), R. Wyrzykowski, J. J. Dongarra, K. Karczewski, and J. Wasniewski, Eds., vol. 8385 of Lecture Notes in Computer Science, Springer, pp. 174--184.Google ScholarGoogle Scholar
  30. Koanantakool, P., Azad, A., Buluç, A., Morozov, D., Oh, S., Oliker, L., and Yelick, K. A. Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In 2016 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016, Chicago, IL, USA, May 23--27, 2016 (2016), IEEE Computer Society, pp. 842--853.Google ScholarGoogle ScholarCross RefCross Ref
  31. Kozubowski, T. J., Panorska, A. K., and Forister, M. L. A discrete truncated pareto distribution. Statistical Methodology 26 (2015), 135--150.Google ScholarGoogle ScholarCross RefCross Ref
  32. Lanczos, C. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. Journal of Research of the National Bureau of Standards 45, 4 (Oct. 1950).Google ScholarGoogle ScholarCross RefCross Ref
  33. Lipton, R. J., and Tarjan, R. E. A separator theorem for planar graphs. SIAM Journal on Applied Mathematics 36, 2 (1979), 177--189.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Mayer, C., Mayer, R., Bhowmik, S., Epple, L., and Rothermel, K. HYPE: massive hypergraph partitioning with neighborhood expansion. In IEEE International Conference on Big Data (IEEE BigData 2018), Seattle, WA, USA, December 10--13, 2018 (2018), N. Abe, H. Liu, C. Pu, X. Hu, N. K. Ahmed, M. Qiao, Y. Song, D. Kossmann, B. Liu, K. Lee, J. Tang, J. He, and J. S. Saltz, Eds., IEEE, pp. 458--467.Google ScholarGoogle ScholarCross RefCross Ref
  35. Muradian, D. The bandwidth minimization problem for cyclic caterpillars with hair length 1 is np-complete. Theor. Comput. Sci. 307, 3 (2003), 567--572.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Okuta, R., Unno, Y., Nishino, D., Hido, S., and Loomis, C. Cupy: A numpy-compatible library for nvidia gpu calculations. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) (2017).Google ScholarGoogle Scholar
  37. Page, B. A., and Kogge, P. M. Scalability of hybrid spmv with hyper-graph partitioning and vertex delegation for communication avoidance. In International Conference on High Performance Computing & Simulation (HPCS 2020) (2021).Google ScholarGoogle Scholar
  38. PAIGE, C. C. Computational Variants of the Lanczos Method for the Eigenproblem. IMA Journal of Applied Mathematics 10, 3 (12 1972), 373--381.Google ScholarGoogle Scholar
  39. Papadimitriou, C. H. The np-completeness of the bandwidth minimization problem. Computing 16, 3 (1976), 263--270.Google ScholarGoogle ScholarCross RefCross Ref
  40. Plotkin, S. A., Rao, S., and Smith, W. D. Shallow excluded minors and improved graph decompositions. In Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms. 23--25 January 1994, Arlington, Virginia, USA (1994), D. D. Sleator, Ed., ACM/SIAM, pp. 462--470.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Rao, S., and Richa, A. W. New approximation techniques for some linear ordering problems. SIAM J. Comput. 34, 2 (2004), 388--404.Google ScholarGoogle Scholar
  42. Raoufi, P., Rostami, H., and Bagherinezhad, H. An optimal time algorithm for minimum linear arrangement of chord graphs. Inf. Sci. 238 (2013), 212--220.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Robertson, N., and Seymour, P. Graph minors. vi. disjoint paths across a disc. Journal of Combinatorial Theory, Series B 41, 1 (1986), 115--138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Schatz, M. D., van de Geijn, R. A., and Poulson, J. Parallel matrix multiplication: A systematic journey. SIAM J. Sci. Comput. 38, 6 (2016).Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Selvitopi, O., Brock, B., Nisa, I., Tripathy, A., Yelick, K. A., and Buluç, A. Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication. In ICS '21: 2021 International Conference on Supercomputing, Virtual Event, USA, June 14--17, 2021 (2021), H. Zhou, J. Moreira, F. Mueller, and Y. Etsion, Eds., ACM, pp. 431--442.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Sun, J., Vandierendonck, H., and Nikolopoulos, D. S. Graphgrind: addressing load imbalance of graph partitioning. In Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA, June 14--16, 2017 (2017), W. D. Gropp, P. Beckman, Z. Li, and F. J. Cazorla, Eds., ACM, pp. 16:1--16:10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Tripathy, A., Yelick, K. A., and Buluç, A. Reducing communication in graph neural network training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9--19, 2020 (2020), C. Cuicchi, I. Qualters, and W. T. Kramer, Eds., IEEE/ACM, p. 70.Google ScholarGoogle ScholarCross RefCross Ref
  48. Turner, J. S. On the probable performance of heuristics for bandwidth minimization. SIAM J. Comput. 15, 2 (1986), 561--580.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ., Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17 (2020), 261--272.Google ScholarGoogle Scholar
  50. Zheng, D., Mhembere, D., Lyzinski, V., Vogelstein, J. T., Priebe, C. E., and Burns, R. C. Semi-external memory sparse matrix multiplication for billion-node graphs. IEEE Trans. Parallel Distributed Syst. 28, 5 (2017), 1470--1483.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Zipf, G. K. Human behavior and the principle of least effort: An introduction to human ecology. Addison-Wesley, Cambridge, Massachusetts, 1949.Google ScholarGoogle Scholar

Index Terms

  1. Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
        March 2024
        498 pages
        ISBN:9798400704352
        DOI:10.1145/3627535

        Copyright © 2024 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 February 2024

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate230of1,014submissions,23%
      • Article Metrics

        • Downloads (Last 12 months)458
        • Downloads (Last 6 weeks)80

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader