skip to main content
10.1145/2925426.2926273acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication

Published: 01 June 2016 Publication History

Abstract

General sparse matrix-matrix multiplication (SpGEMM) is a core component of many algorithms. A number of recent works have used high throughput graphics processing units (GPUs) to accelerate SpGEMM. However, exploiting the power of GPUs for SpGEMM requires addressing a number of challenges, including highly imbalanced workloads and large numbers of inefficient random global memory accesses. This paper presents a SpGEMM algorithm which uses several novel techniques to overcome these problems. We first propose two low cost methods to achieve perfect load balancing during the most expensive step in SpGEMM. Next, we show how to eliminate nearly all random global memory accesses using shared memory based hash tables. To optimize the performance of the hash tables, we propose a lightweight method to estimate the number of nonzeros in the output matrix. We compared our algorithm to the CUSP, CUSPARSE and the state-of-the-art BHSPARSE GPU SpGEMM algorithms, and show that it performs 5.6x, 2.4x and 1.5x better on average, and up to 11.8x, 9.5x and 2.5x better in the best case, respectively. Furthermore, we show that our algorithm performs especially well on highly imbalanced and unstructured matrices.

References

[1]
Nivida cusparse library. https://developer.nvidia.com/cusparse.
[2]
R. R. Amossen, A. Campagna, and R. Pagh. Better size estimation for sparse matrix products. In Proceedings of the 13th International Conference on Approximation, and 14 the International Conference on Randomization, and Combinatorial Optimization: Algorithms and Techniques, APPROX/RANDOM'10, pages 406--419, Berlin, Heidelberg, 2010. Springer-Verlag.
[3]
N. Bell, S. Dalton, and L. N. Olson. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing, 34(4):C123--C152, 2012.
[4]
N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 18. ACM, 2009.
[5]
T. M. Chan. More algorithms for all-pairs shortest paths in weighted graphs. In Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of Computing, STOC '07, pages 590--598, New York, NY, USA, 2007. ACM.
[6]
S. Dalton, N. Bell, L. Olson, and M. Garland. Cusp: Generic parallel algorithms for sparse matrix and graph computations, 2014. Version 0.5.0.
[7]
S. Dalton, L. Olson, and N. Bell. Optimizing sparse matrix-matrix multiplication for the gpu. ACM Trans. Math. Softw., 41(4):25:1--25:20, Oct. 2015.
[8]
T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1:1--1:25, Dec. 2011.
[9]
J. Demouth. Sparse matrix-matrix multiplication on the gpu. In Proceedings of the GPU Technology Conference, 2012.
[10]
J. R. Gilbert, S. Reinhardt, and V. B. Shah. High-performance graph algorithms from parallel sparse matrices. In B. KÃěgstrÃűm, E. Elmroth, J. Dongarra, and J. WaÅŻniewski, editors, Applied Parallel Computing. State of the Art in Scientific Computing, volume 4699 of Lecture Notes in Computer Science, pages 260--269. Springer Berlin Heidelberg, 2007.
[11]
O. Green, R. McColl, and D. A. Bader. Gpu merge path: a gpu merging algorithm. In Proceedings of the 26th ACM international conference on Supercomputing, pages 331--340. ACM, 2012.
[12]
M. Harris, S. Sengupta, and J. D. Owens. Parallel prefix sum (scan) with cuda. GPU gems, 3(39):851--876, 2007.
[13]
S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating cuda graph algorithms at maximum warp. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM.
[14]
H. Kaplan, M. Sharir, and E. Verbin. Colored intersection searching via sparse rectangular matrix multiplication. In Proceedings of the Twenty-second Annual Symposium on Computational Geometry, SCG '06, pages 52--60, New York, NY, USA, 2006. ACM.
[15]
W. Liu and B. Vinter. An efficient gpu general sparse matrix-matrix multiplication for irregular data. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 370--381, May 2014.
[16]
D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117--128, New York, NY, USA, 2012. ACM.
[17]
D. Merrill and A. Grimshaw. High performance and scalable radix sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters, 21(02):245--272, 2011.
[18]
S. Sengupta, M. Harris, and M. Garland. Efficient parallel scan algorithms for gpus. NVIDIA, Santa Clara, CA, Tech. Rep. NVR-2008-003, (1):1--17, 2008.
[19]
V. Vassilevska, R. Williams, and R. Yuster. Finding heaviest h-subgraphs in real weighted graphs, with applications. ACM Trans. Algorithms, 6(3):44:1--44:23, July 2010.

Cited By

View all
  • (2024)HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUsProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658680(160-173)Online publication date: 3-Jun-2024
  • (2024)Optimizing sparse general matrix–matrix multiplication for DCUsThe Journal of Supercomputing10.1007/s11227-024-06234-280:14(20176-20200)Online publication date: 30-May-2024
  • (2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '16: Proceedings of the 2016 International Conference on Supercomputing
June 2016
547 pages
ISBN:9781450343619
DOI:10.1145/2925426
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. Performance optimization
  3. Sparse matrix-matrix multiplication

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICS '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)3
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUsProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658680(160-173)Online publication date: 3-Jun-2024
  • (2024)Optimizing sparse general matrix–matrix multiplication for DCUsThe Journal of Supercomputing10.1007/s11227-024-06234-280:14(20176-20200)Online publication date: 30-May-2024
  • (2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
  • (2023)Efficient Execution of SpGEMM on Long Vector ArchitecturesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593000(101-113)Online publication date: 7-Aug-2023
  • (2023)A Systematic Survey of General Sparse Matrix-matrix MultiplicationACM Computing Surveys10.1145/357115755:12(1-36)Online publication date: 2-Mar-2023
  • (2023)Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric BehaviorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321782434:1(246-261)Online publication date: 1-Jan-2023
  • (2023)Predicting the Output Structure of Sparse Matrix Multiplication with Sampled Compression Ratio2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS56603.2022.00069(483-490)Online publication date: Jan-2023
  • (2023)Orchestrating Large-Scale SpGEMMs using Dynamic Block Distribution and Data Transfer Minimization on Heterogeneous Systems2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00189(2456-2459)Online publication date: Apr-2023
  • (2023)DeltaSPARSE: High-Performance Sparse General Matrix-Matrix Multiplication on Multi-GPU Systems2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00037(194-202)Online publication date: 18-Dec-2023
  • (2022)TileSpGEMMProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508431(90-106)Online publication date: 2-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media