research-article

Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication

Authors:

Pham Nguyen Quang Anh,

Yonggang WenAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 36, Pages 1 - 12

https://doi.org/10.1145/2925426.2926273

Published: 01 June 2016 Publication History

Abstract

General sparse matrix-matrix multiplication (SpGEMM) is a core component of many algorithms. A number of recent works have used high throughput graphics processing units (GPUs) to accelerate SpGEMM. However, exploiting the power of GPUs for SpGEMM requires addressing a number of challenges, including highly imbalanced workloads and large numbers of inefficient random global memory accesses. This paper presents a SpGEMM algorithm which uses several novel techniques to overcome these problems. We first propose two low cost methods to achieve perfect load balancing during the most expensive step in SpGEMM. Next, we show how to eliminate nearly all random global memory accesses using shared memory based hash tables. To optimize the performance of the hash tables, we propose a lightweight method to estimate the number of nonzeros in the output matrix. We compared our algorithm to the CUSP, CUSPARSE and the state-of-the-art BHSPARSE GPU SpGEMM algorithms, and show that it performs 5.6x, 2.4x and 1.5x better on average, and up to 11.8x, 9.5x and 2.5x better in the best case, respectively. Furthermore, we show that our algorithm performs especially well on highly imbalanced and unstructured matrices.

References

[1]

Nivida cusparse library. https://developer.nvidia.com/cusparse.

[2]

R. R. Amossen, A. Campagna, and R. Pagh. Better size estimation for sparse matrix products. In Proceedings of the 13th International Conference on Approximation, and 14 the International Conference on Randomization, and Combinatorial Optimization: Algorithms and Techniques, APPROX/RANDOM'10, pages 406--419, Berlin, Heidelberg, 2010. Springer-Verlag.

Digital Library

[3]

N. Bell, S. Dalton, and L. N. Olson. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing, 34(4):C123--C152, 2012.

Digital Library

[4]

N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 18. ACM, 2009.

Digital Library

[5]

T. M. Chan. More algorithms for all-pairs shortest paths in weighted graphs. In Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of Computing, STOC '07, pages 590--598, New York, NY, USA, 2007. ACM.

Digital Library

[6]

S. Dalton, N. Bell, L. Olson, and M. Garland. Cusp: Generic parallel algorithms for sparse matrix and graph computations, 2014. Version 0.5.0.

[7]

S. Dalton, L. Olson, and N. Bell. Optimizing sparse matrix-matrix multiplication for the gpu. ACM Trans. Math. Softw., 41(4):25:1--25:20, Oct. 2015.

Digital Library

[8]

T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1:1--1:25, Dec. 2011.

Digital Library

[9]

J. Demouth. Sparse matrix-matrix multiplication on the gpu. In Proceedings of the GPU Technology Conference, 2012.

[10]

J. R. Gilbert, S. Reinhardt, and V. B. Shah. High-performance graph algorithms from parallel sparse matrices. In B. KÃěgstrÃűm, E. Elmroth, J. Dongarra, and J. WaÅŻniewski, editors, Applied Parallel Computing. State of the Art in Scientific Computing, volume 4699 of Lecture Notes in Computer Science, pages 260--269. Springer Berlin Heidelberg, 2007.

Digital Library

[11]

O. Green, R. McColl, and D. A. Bader. Gpu merge path: a gpu merging algorithm. In Proceedings of the 26th ACM international conference on Supercomputing, pages 331--340. ACM, 2012.

Digital Library

[12]

M. Harris, S. Sengupta, and J. D. Owens. Parallel prefix sum (scan) with cuda. GPU gems, 3(39):851--876, 2007.

[13]

S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating cuda graph algorithms at maximum warp. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM.

Digital Library

[14]

H. Kaplan, M. Sharir, and E. Verbin. Colored intersection searching via sparse rectangular matrix multiplication. In Proceedings of the Twenty-second Annual Symposium on Computational Geometry, SCG '06, pages 52--60, New York, NY, USA, 2006. ACM.

Digital Library

[15]

W. Liu and B. Vinter. An efficient gpu general sparse matrix-matrix multiplication for irregular data. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 370--381, May 2014.

Digital Library

[16]

D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117--128, New York, NY, USA, 2012. ACM.

Digital Library

[17]

D. Merrill and A. Grimshaw. High performance and scalable radix sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters, 21(02):245--272, 2011.

[18]

S. Sengupta, M. Harris, and M. Garland. Efficient parallel scan algorithms for gpus. NVIDIA, Santa Clara, CA, Tech. Rep. NVR-2008-003, (1):1--17, 2008.

[19]

V. Vassilevska, R. Williams, and R. Yuster. Finding heaviest h-subgraphs in real weighted graphs, with applications. ACM Trans. Algorithms, 6(3):44:1--44:23, July 2010.

Digital Library

Cited By

Xu LJia HZhang YWang LJiang XMencagli GDazzi PLowenthal DBadia R(2024)HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUsProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658680(160-173)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658680
Guo HWang HChen WZhang CHan YZhu SZhang DGuo YShang JWan TLi QWu G(2024)Optimizing sparse general matrix–matrix multiplication for DCUsThe Journal of Supercomputing10.1007/s11227-024-06234-280:14(20176-20200)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1007/s11227-024-06234-2
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Show More Cited By

Recommendations

Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores
PPoPP '25: Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

General-purpose Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel in scientific computing and deep learning. The emergence of new matrix computation units such as Tensor Cores (TCs) brings more opportunities for SpMM acceleration. ...
A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors

General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient ...
Efficient sparse-matrix multi-vector product on GPUs
HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing

Sparse Matrix-Vector (SpMV) and Sparse Matrix-Multivector (SpMM) products are key kernels for computational science and data science. While GPUs offer significantly higher peak performance and memory bandwidth than multicore CPUs, achieving high ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
575
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu LJia HZhang YWang LJiang XMencagli GDazzi PLowenthal DBadia R(2024)HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUsProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658680(160-173)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658680
Guo HWang HChen WZhang CHan YZhu SZhang DGuo YShang JWan TLi QWu G(2024)Optimizing sparse general matrix–matrix multiplication for DCUsThe Journal of Supercomputing10.1007/s11227-024-06234-280:14(20176-20200)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1007/s11227-024-06234-2
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Le Fèvre VCasas MButt AMi NChard K(2023)Efficient Execution of SpGEMM on Long Vector ArchitecturesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593000(101-113)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3593000
Gao JJi WChang FHan SWei BLiu ZWang Y(2023)A Systematic Survey of General Sparse Matrix-matrix MultiplicationACM Computing Surveys10.1145/357115755:12(1-36)Online publication date: 2-Mar-2023
https://dl.acm.org/doi/10.1145/3571157
Sun WLi AGeng TStuijk SCorporaal H(2023)Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric BehaviorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321782434:1(246-261)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3217824
Du ZGuan YGuan TNiu DTan NYu XZheng HMeng JYan XXie Y(2023)Predicting the Output Structure of Sparse Matrix Multiplication with Sampled Compression Ratio2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS56603.2022.00069(483-490)Online publication date: Jan-2023
https://doi.org/10.1109/ICPADS56603.2022.00069
Park TKang SJang MKim SPark Y(2023)Orchestrating Large-Scale SpGEMMs using Dynamic Block Distribution and Data Transfer Minimization on Heterogeneous Systems2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00189(2456-2459)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00189
Yang SZhang CMa J(2023)DeltaSPARSE: High-Performance Sparse General Matrix-Matrix Multiplication on Multi-GPU Systems2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00037(194-202)Online publication date: 18-Dec-2023
https://doi.org/10.1109/HiPC58850.2023.00037
Niu YLu ZJi HSong SJin ZLiu WLee JAgrawal KSpear M(2022)TileSpGEMMProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508431(90-106)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508431
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten