skip to main content
research-article
Free Access

A Cross-Platform SpMV Framework on Many-Core Architectures

Authors Info & Claims
Published:25 October 2016Publication History
Skip Abstract Section

Abstract

Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth remain the critical performance bottlenecks. We present our novel solutions to these problems, for both GPUs and Intel MIC many-core architectures. First, we devise a new SpMV format, called Blocked Compressed Common Coordinate (BCCOO). BCCOO extends the blocked Common Coordinate (COO) by using bit flags to store the row indices to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices for better data locality. Then, to address the load imbalance problem, we propose a highly efficient matrix-based segmented sum/scan algorithm for SpMV, which eliminates global synchronization. At last, we introduce an autotuning framework to choose optimization parameters. Experimental results show that our proposed framework has a significant advantage over the existing SpMV libraries. In single precision, our proposed scheme outperforms clSpMV COCKTAIL format by 255% on average on AMD FirePro W8000, and outperforms CUSPARSE V7.0 by 73.7% on average and outperforms CSR5 by 53.6% on average on GeForce Titan X; in double precision, our proposed scheme outperforms CUSPARSE V7.0 by 34.0% on average and outperforms CSR5 by 16.2% on average on Tesla K20, and has equivalent performance compared with CSR5 on Intel MIC.

References

  1. Muthu Manikandan Baskaran and Rajesh Bordawekar. 2008. Optimizing sparse matrix-vector multiplication on GPUs using compile-time and run-time strategies. IBM Reserach Report, RC24704 (W0812-047) (2008).Google ScholarGoogle Scholar
  2. Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Guy E. Blelloch. 1989. Scans as primitive parallel operations. IEEE Transactions on Computers 38, 11 (1989), 1526--1538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Guy E. Blelloch, Michael A. Heroux, and Marco Zagha. 1993. Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors. Technical Report. DTIC Document. Google ScholarGoogle Scholar
  5. Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schröoder. 2003. Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. In ACM Transactions on Graphics (TOG), Vol. 22. ACM, 917--924. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Aydın Buluç, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In Proceedings of the 2011 IEEE International Parallel 8 Distributed Processing Symposium (IPDPS). IEEE, 721--733. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In ACM Sigplan Notices, Vol. 45. ACM, 115--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Mayank Daga and Joseph L. Greathouse. 2015. Structural agnostic SpMV: Adapting CSR-adaptive for irregular matrices. In Proceedings of the 2015 IEEE 22nd International Conference on High Performance Computing (HiPC). IEEE, 64--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yuri Dotsenko, Naga K. Govindaraju, Peter-Pike Sloan, Charles Boyd, and John Manferdelli. 2008. Fast scan algorithms on graphics processors. In Proceedings of the 22nd Annual International Conference on Supercomputing. ACM, 205--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Joseph L. Greathouse and Mayank Daga. 2014. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In SC14 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 769--780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Mark Harris, John Owens, Shubho Sengupta, Yao Zhang, and Andrew Davidson. 2007. CUDPP: CUDA data parallel primitives library. (2007).Google ScholarGoogle Scholar
  13. Zbigniew Koza, Maciej Matyka, Sebastian Szkoda, and Łukasz Mirosław. 2012. Compressed Multiple-row Storage Format. Technical Report.Google ScholarGoogle Scholar
  14. Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. ShiGang Li, ChangJun Hu, JunChao Zhang, and YunQuan Zhang. 2015. Automatic tuning of sparse matrix-vector multiplication on multicore clusters. Science China Information Sciences 58, 9 (2015), 1--14. Google ScholarGoogle ScholarCross RefCross Ref
  16. Weifeng Liu and Brian Vinter. 2015. Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 339--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM, 273--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yongchao Liu and Bertil Schmidt. 2015. LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs. In Proceedings of the 2015 IEEE 26th International Conference on Application-Specific Systems, Architectures and Processors (ASAP). IEEE, 82--89.Google ScholarGoogle ScholarCross RefCross Ref
  19. Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In High Performance Embedded Architectures and Compilers. Springer, 111--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. CUDA NVIDIA. 2014. CUSPARSE library. (2014). NVIDIA Corporation, Santa Clara, California.Google ScholarGoogle Scholar
  21. Juan C. Pichel, Francisco F. Rivera, Marcos Fernández, and Aurelio Rodríguez. 2012. Optimization of sparse matrix--vector multiplication using reordering techniques on GPUs. Microprocessors and Microsystems 36, 2 (2012), 65--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. 2007. Scan primitives for GPU computing. In Graphics Hardware, Vol. 2007. 97--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science 8 Engineering 12, 1--3 (2010), 66--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Bor-Yiing Su and Kurt Keutzer. 2012. clSpMV: A cross-platform OpenCL SpMV framework on GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing. ACM, 353--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Wai Teng Tang, Wen Jun Tan, Rajarshi Ray, Yi Wen Wong, Weiguang Chen, Shyh-hao Kuo, Rick Siow Mong Goh, Stephen John Turner, and Weng-Fai Wong. 2013. Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sain-Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and W. Hwu Wen-mei. 2008. CUDA-lite: Reducing GPU programming complexity. In Languages and Compilers for Parallel Computing. Springer, 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Francisco Vázquez, José-Jesús Fernández, and Ester M. Garzón. 2011. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurrency and Computation: Practice and Experience 23, 8 (2011), 815--826. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2009. Optimization of sparse matrix--vector multiplication on emerging multicore platforms. Parallel Comput. 35, 3 (2009), 178--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Shengen Yan, Guoping Long, and Yunquan Zhang. 2013. StreamScan: Fast scan algorithms for GPUs without global barrier synchronization. In ACM SIGPLAN Notices, Vol. 48. ACM, 229--238. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Cross-Platform SpMV Framework on Many-Core Architectures

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 4
        December 2016
        648 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/3012405
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 October 2016
        • Revised: 1 August 2016
        • Accepted: 1 August 2016
        • Received: 1 February 2016
        Published in taco Volume 13, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader