A Cross-Platform SpMV Framework on Many-Core Architectures

Authors:
Yunquan Zhang

State Key Laboratory of Computer Architecture, Institute of Computing Technologies, Chinese Academy of Sciences, Beijing, China

State Key Laboratory of Computer Architecture, Institute of Computing Technologies, Chinese Academy of Sciences, Beijing, China
View Profile

,
Shigang Li

State Key Laboratory of Computer Architecture, Institute of Computing Technologies, Chinese Academy of Sciences, Beijing, China

State Key Laboratory of Computer Architecture, Institute of Computing Technologies, Chinese Academy of Sciences, Beijing, China
View Profile

,
Shengen Yan

SenseTime Group Limited, Department of Information Engineering, Chinese University of Hong Kong

SenseTime Group Limited, Department of Information Engineering, Chinese University of Hong Kong
View Profile

,
Huiyang Zhou

Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC

Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC
View Profile

ACM Transactions on Architecture and Code Optimization Volume 13 Issue 4Article No.: 33pp 1–25https://doi.org/10.1145/2994148

Published:25 October 2016Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth remain the critical performance bottlenecks. We present our novel solutions to these problems, for both GPUs and Intel MIC many-core architectures. First, we devise a new SpMV format, called Blocked Compressed Common Coordinate (BCCOO). BCCOO extends the blocked Common Coordinate (COO) by using bit flags to store the row indices to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices for better data locality. Then, to address the load imbalance problem, we propose a highly efficient matrix-based segmented sum/scan algorithm for SpMV, which eliminates global synchronization. At last, we introduce an autotuning framework to choose optimization parameters. Experimental results show that our proposed framework has a significant advantage over the existing SpMV libraries. In single precision, our proposed scheme outperforms clSpMV COCKTAIL format by 255% on average on AMD FirePro W8000, and outperforms CUSPARSE V7.0 by 73.7% on average and outperforms CSR5 by 53.6% on average on GeForce Titan X; in double precision, our proposed scheme outperforms CUSPARSE V7.0 by 34.0% on average and outperforms CSR5 by 16.2% on average on Tesla K20, and has equivalent performance compared with CSR5 on Intel MIC.

References

Muthu Manikandan Baskaran and Rajesh Bordawekar. 2008. Optimizing sparse matrix-vector multiplication on GPUs using compile-time and run-time strategies. IBM Reserach Report, RC24704 (W0812-047) (2008).Google Scholar
Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 18. Google ScholarDigital Library
Guy E. Blelloch. 1989. Scans as primitive parallel operations. IEEE Transactions on Computers 38, 11 (1989), 1526--1538. Google ScholarDigital Library
Guy E. Blelloch, Michael A. Heroux, and Marco Zagha. 1993. Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors. Technical Report. DTIC Document. Google Scholar
Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schröoder. 2003. Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. In ACM Transactions on Graphics (TOG), Vol. 22. ACM, 917--924. Google ScholarDigital Library
Aydın Buluç, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In Proceedings of the 2011 IEEE International Parallel 8 Distributed Processing Symposium (IPDPS). IEEE, 721--733. Google ScholarDigital Library
Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In ACM Sigplan Notices, Vol. 45. ACM, 115--126. Google ScholarDigital Library
Mayank Daga and Joseph L. Greathouse. 2015. Structural agnostic SpMV: Adapting CSR-adaptive for irregular matrices. In Proceedings of the 2015 IEEE 22nd International Conference on High Performance Computing (HiPC). IEEE, 64--74. Google ScholarDigital Library
Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1. Google ScholarDigital Library
Yuri Dotsenko, Naga K. Govindaraju, Peter-Pike Sloan, Charles Boyd, and John Manferdelli. 2008. Fast scan algorithms on graphics processors. In Proceedings of the 22nd Annual International Conference on Supercomputing. ACM, 205--213. Google ScholarDigital Library
Joseph L. Greathouse and Mayank Daga. 2014. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In SC14 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 769--780. Google ScholarDigital Library
Mark Harris, John Owens, Shubho Sengupta, Yao Zhang, and Andrew Davidson. 2007. CUDPP: CUDA data parallel primitives library. (2007).Google Scholar
Zbigniew Koza, Maciej Matyka, Sebastian Szkoda, and Łukasz Mirosław. 2012. Compressed Multiple-row Storage Format. Technical Report.Google Scholar
Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423. Google ScholarDigital Library
ShiGang Li, ChangJun Hu, JunChao Zhang, and YunQuan Zhang. 2015. Automatic tuning of sparse matrix-vector multiplication on multicore clusters. Science China Information Sciences 58, 9 (2015), 1--14. Google ScholarCross Ref
Weifeng Liu and Brian Vinter. 2015. Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 339--350. Google ScholarDigital Library
Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM, 273--282. Google ScholarDigital Library
Yongchao Liu and Bertil Schmidt. 2015. LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs. In Proceedings of the 2015 IEEE 26th International Conference on Application-Specific Systems, Architectures and Processors (ASAP). IEEE, 82--89.Google ScholarCross Ref
Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In High Performance Embedded Architectures and Compilers. Springer, 111--125. Google ScholarDigital Library
CUDA NVIDIA. 2014. CUSPARSE library. (2014). NVIDIA Corporation, Santa Clara, California.Google Scholar
Juan C. Pichel, Francisco F. Rivera, Marcos Fernández, and Aurelio Rodríguez. 2012. Optimization of sparse matrix--vector multiplication using reordering techniques on GPUs. Microprocessors and Microsystems 36, 2 (2012), 65--77. Google ScholarDigital Library
Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. 2007. Scan primitives for GPU computing. In Graphics Hardware, Vol. 2007. 97--106. Google ScholarDigital Library
John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science 8 Engineering 12, 1--3 (2010), 66--73. Google ScholarDigital Library
Bor-Yiing Su and Kurt Keutzer. 2012. clSpMV: A cross-platform OpenCL SpMV framework on GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing. ACM, 353--364. Google ScholarDigital Library
Wai Teng Tang, Wen Jun Tan, Rajarshi Ray, Yi Wen Wong, Weiguang Chen, Shyh-hao Kuo, Rick Siow Mong Goh, Stephen John Turner, and Weng-Fai Wong. 2013. Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 26. Google ScholarDigital Library
Sain-Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and W. Hwu Wen-mei. 2008. CUDA-lite: Reducing GPU programming complexity. In Languages and Compilers for Parallel Computing. Springer, 1--15. Google ScholarDigital Library
Francisco Vázquez, José-Jesús Fernández, and Ester M. Garzón. 2011. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurrency and Computation: Practice and Experience 23, 8 (2011), 815--826. Google ScholarDigital Library
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2009. Optimization of sparse matrix--vector multiplication on emerging multicore platforms. Parallel Comput. 35, 3 (2009), 178--194. Google ScholarDigital Library
Shengen Yan, Guoping Long, and Yunquan Zhang. 2013. StreamScan: Fast scan algorithms for GPUs without global barrier synchronization. In ACM SIGPLAN Notices, Vol. 48. ACM, 229--238. Google ScholarDigital Library

Index Terms

A Cross-Platform SpMV Framework on Many-Core Architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms

Recommendations

yaSpMV: yet another SpMV framework on GPUs
PPoPP '14

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work ...
Read More
yaSpMV: yet another SpMV framework on GPUs
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work ...
Read More
Performance analysis of the OP2 framework on many-core architectures
Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)

We present a performance analysis and benchmarking study of the OP2 "active" library, which provides an abstraction framework for the solution of parallel unstructured mesh applications. OP2 aims to decouple the scientific specification of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 13, Issue 4
December 2016
648 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3012405
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 October 2016
- Revised: 1 August 2016
- Accepted: 1 August 2016
- Received: 1 February 2016
Published in taco Volume 13, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
BCCOO
CUDA
GPU
Intel MIC
OpenCL
SpMV
parallel algorithms
segmented scan
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 42
  Total Citations
  View Citations
- 692
  Total Downloads
- Downloads (Last 12 months)64
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.