ABSTRACT
Important workloads, such as machine learning and graph analytics applications, heavily involve sparse linear algebra operations. These operations use sparse matrix compression as an effective means to avoid storing zeros and performing unnecessary computation on zero elements. However, compression techniques like Compressed Sparse Row (CSR) that are widely used today introduce significant instruction overhead and expensive pointer-chasing operations to discover the positions of the non-zero elements. In this paper, we identify the discovery of the positions (i.e., indexing) of non-zero elements as a key bottleneck in sparse matrix-based workloads, which greatly reduces the benefits of compression.
We propose SMASH, a hardware-software cooperative mechanism that enables highly-efficient indexing and storage of sparse matrices. The key idea of SMASH is to explicitly enable the hardware to recognize and exploit sparsity in data. To this end, we devise a novel software encoding based on a hierarchy of bitmaps. This encoding can be used to efficiently compress any sparse matrix, regardless of the extent and structure of sparsity. At the same time, the bitmap encoding can be directly interpreted by the hardware. We design a lightweight hardware unit, the Bitmap Management Unit (BMU), that buffers and scans the bitmap hierarchy to perform highly-efficient indexing of sparse matrices. SMASH exposes an expressive and rich ISA to communicate with the BMU, which enables its use in accelerating any sparse matrix computation.
We demonstrate the benefits of SMASH on four use cases that include sparse matrix kernels and graph analytics applications. Our evaluations show that SMASH provides average performance improvements of 38% for Sparse Matrix Vector Multiplication and 44% for Sparse Matrix Matrix Multiplication, over a state-of-the-art CSR implementation, on a wide variety of matrices with different characteristics. SMASH incurs a very modest hardware area overhead of up to 0.076% of an out-of-order CPU core.
- Intel Math Kernel Library. http://software.intel.com/en-us/articles/intel-mkl/Google Scholar
- Intel Xeon Gold 5118. https://ark.intel.com/content/www/us/en/ark/products/120473/intel-xeon-gold-5118-processor-16-5m-cache-2-30-ghz.html.Google Scholar
- SMASH code. https://github.com/CMU-SAFARI/SMASHGoogle Scholar
- J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing," in ISCA, 2015.Google Scholar
- J. Ahn, S. Yoo, O. Mutlu, and K. Y. Choi, "PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture," 2015.Google Scholar
- K. Akbudak and C. Aykanat, "Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures," TPDS, 2017.Google Scholar
- M. Belgin, G. Back, and C. J. Ribbens, "Pattern-Based Sparse Matrix Representation for Memory-Efficient SMVM Kernels," in ISC, 2009.Google Scholar
- M. Besta, F. Marending, E. Solomonik, and T. Hoefler, "SlimSell: A Vectorizable Graph Representation for Breadth-First Search," in IPDPS, 2017.Google Scholar
- M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler, "To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations," in HPDC, 2017.Google Scholar
- J. Bolz, I. Farmer, E. Grinspun, and P. Schröder, "Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid," SIGGRAPH, 2003.Google Scholar
- A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks," ser. ASPLOS, 2018.Google ScholarDigital Library
- S. Brin and L. Page, "The Anatomy of a Large-scale Hypertextual Web Search Engine," in WWW, 1998.Google Scholar
- A. Buluc and J. R. Gilbert, "Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication," in ICPP, 2008.Google Scholar
- A. Buluç and J. R. Gilbert, "Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments," SISC, 2012.Google Scholar
- A. Buluc, S. Williams, L. Oliker, and J. Demmel, "Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication," in IPDPS, 2011.Google Scholar
- A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, "Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks," in SPAA, 2009.Google Scholar
- J. Cui and Q. Qiu, "Towards Memristor based Accelerator for Sparse Matrix Vector Multiplication," in ISCAS, 2016.Google Scholar
- S. Dalton, L. Olson, and N. Bell, "Optimizing Sparse Matrix-Matrix Multiplication for the GPU," TMS, 2015.Google Scholar
- T. A. Davis and Y. Hu, "The University of Florida Sparse Matrix Collection," TOMS, 2011.Google Scholar
- J. Dongarra, A. Lumsdaine, X. Niu, R. Pozoz, and K. Remington, "Sparse Matrix Libraries in C++ for High Performance Architectures," 1994.Google Scholar
- A. Dziekonski and M. Mrozowski, "A GPU Solver for Sparse Generalized Eigenvalue Problems with Symmetric Complex-Valued Matrices Obtained Using Higher-Order FEM," IEEE Access, 2018.Google Scholar
- A. Elafrou, G. Goumas, and N. Koziris, "Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi-and Many-Core Processors," in ICPP, 2017.Google Scholar
- A. Elafrou, V. Karakasis, T. Gkountouvas, K. Kourtis, G. Goumas, and N. Koziris, "SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms," TOMS, 2018.Google Scholar
- R. D. Falgout, "An Introduction to Algebraic Multigrid," Computing in Science Engineering, 2006.Google Scholar
- R. D. Falgout and U. M. Yang, "hypre: A Library of High Performance Preconditioners," in ICCS, 2002.Google Scholar
- J. Fowers, K. Ovtcharov, K. Strauss, E. S. Chung, and G. Stitt, "A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication," in FCCM, 2014.Google Scholar
- L. Freeman, "A Set of Measures of Centrality Based on Betweenness," Sociometry, 1977.Google Scholar
- M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, "TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory," in ASPLOS, 2017.Google Scholar
- B. Graham, "Spatially-Sparse Convolutional Neural Networks," arXiv, 2014.Google Scholar
- J. L. Greathouse and M. Daga, "Efficient Sparse Matrix-vector Multiplication on GPUs Using the CSR Storage Format," in SC, 2014.Google Scholar
- F. Gremse, A. Höfter, L. O. Schwen, F. Kiessling, and U. Naumann, "GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging," SIAM, 2015.Google Scholar
- P. Grigoras, P. Burovskiy, E. Hung, and W. Luk, "Accelerating SpMV on FPGAs by Compressing Nonzero Values," in FCCM, 2015.Google Scholar
- U. Gupta, X. Wang, M. Naumov, C. Wu, B. Reagen, D. Brooks, B. Cottel, K. M. Hazelwood, B. Jia, H. S. Lee, A. Malevich, D. Mudigere, M. Smelyanskiy, L. Xiong, and X. Zhang, "The Architectural Implications of Facebook's DNN-based Personalized Recommendation," CoRR, 2019.Google Scholar
- P. Hénon, P. Ramet, and J. Roman, "PASTIX: A High-Performance Parallel Direct Solver for Sparse Symmetric Positive Definite Systems," PMAA, 2002.Google Scholar
- C. Hong, A. Sukumaran-Rajam, B. Bandyopadhyay, J. Kim, S. E. Kurt, I. Nisa, S. Sabhlok, Ü. V. Çatalyürek, S. Parthasarathy, and P. Sadayappan, "Efficient Sparse-Matrix Multi-Vector Product on GPUs," in HPDC, 2018.Google Scholar
- K. Hsieh, S. M. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and O. Mutlu, "Accelerating Pointer Chasing in 3D-Stacked memory: Challenges, Mechanisms, Evaluation," 2016.Google Scholar
- E.-J. Im, K. Yelick, and R. Vuduc, "Sparsity: Optimization Framework for Sparse Matrix Kernels," IJHPCA, 2004.Google Scholar
- E.-J. Im and K. A. Yelick, "Optimizing Sparse Matrix Vector Multiplication on SMP." in PPSC, 1999.Google Scholar
- J. Kestyn, V. Kalantzis, E. Polizzi, and Y. Saad, "PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-memory Linear Solvers," in SC, 2016.Google Scholar
- F. Kjolstad, S. Chou, D. Lugato, S. Kamil, and S. Amarasinghe, "taco: A Tool to Generate Tensor Algebra Kernels," in ASE, 2017.Google Scholar
- K. Kourtis, G. Goumas, and N. Koziris, "Optimizing Sparse Matrix-Vector Multiplication using Index and Value Compression," in CF, 2008.Google Scholar
- K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris, "CSX: An Extended Compression Format for SpMV on Shared Memory Systems," in PPoPP, 2011.Google Scholar
- N. Kurd, S. Bhamidipati, C. Mozak, J. L. Miller, T. M. Wilson, M. Nemani, and M. Chowdhury, "Westmere: A Family of 32nm IA Processors," in ISSCC, 2010.Google Scholar
- D. Langr and P. Tvrdik, "Evaluation Criteria for Sparse Matrix Storage Formats," in TPDS, 2016.Google Scholar
- J. Leskovec and R. Sosič, "Snap: A General-Purpose Network Analysis and Graph-Mining Library," TIST, 2016.Google Scholar
- J. Li, G. Tan, M. Chen, and N. Sun, "SMAT: An Input Adaptive Auto-tuner for Sparse Matrix-vector Multiplication," in PLDI, 2013.Google Scholar
- S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, "CACTI-P: Architecture-Level Modeling for SRAM-Based Structures with Advanced Leakage Reduction Techniques," in CAD, 2011.Google Scholar
- C. Y. Lin, N. Wong, and H. K.-H. So, "Design Space Exploration for Sparse Matrix-Matrix Multiplication on FPGAs," FPT, 2013.Google Scholar
- G. Linden, B. Smith, and J. York, "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IC, 2003.Google Scholar
- B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, "Sparse Convolutional Neural Networks," in CVPR, 2015.Google Scholar
- W. Liu and B. Vinter, "An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data," in IPDPS, 2014.Google Scholar
- W. Liu and B. Vinter, "A Framework for General Sparse Matrix-matrix Multiplication on GPUs and Heterogeneous Processors," JPDC, 2015.Google Scholar
- W. Liu and B. Vinter, "CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication," in ICS, 2015.Google Scholar
- X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, "Efficient Sparse Matrix-Vector Multiplication on x86-Based Many-Core Processors," in SC, 2013.Google Scholar
- K. K. Matam, S. R. K. B. Indarapu, and K. Kothapalli, "Sparse Matrix-Matrix Multiplication on Modern Architectures," in HiPC, 2012.Google Scholar
- J. Mellor-Crummey and J. Garvin, "Optimizing Sparse Matrix-Vector Product Computations using Unroll and Jam," IJHPCA, 2004.Google Scholar
- D. Merrill and M. Garland, "Merge-Based Parallel Sparse Matrix-Vector Multiplication," in SC, 2016.Google Scholar
- D. Merrill and M. Garland, "Merge-Based Sparse Matrix-Vector Multiplication (SpMV) using the CSR Storage Format," in PPoPP, 2016.Google Scholar
- A. K. Mishra, E. Nurvitadhi, G. Venkatesh, J. Pearce, and D. Marr, "Fine-Grained Accelerators for Sparse Machine Learning Workloads," in ASP-DAC, 2017.Google Scholar
- A. Mukkara, N. Beckmann, M. Abeydeera, X. Ma, and D. Sanchez, "Exploiting Locality in Graph Analytics Through Hardware-Accelerated Traversal Scheduling," in MICRO, 2018.Google Scholar
- M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, "Deep Learning Recommendation Model for Personalization and Recommendation Systems," CoRR, 2019.Google Scholar
- R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick, "When Cache Blocking of Sparse Matrix Vector Multiply Works and Why," AAECC, 2007.Google Scholar
- E. Nurvitadhi, A. Mishra, and D. Marr, "A Sparse Matrix Vector Multiply Accelerator for Support Vector Machine," in CASES, 2015.Google Scholar
- E. Nurvitadhi, A. Mishra, Y. Wang, G. Venkatesh, and D. Marr, "Hardware Accelerator for Analytics of Sparse Data," in DAC, 2016.Google Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd, "The PageRank Citation Ranking: Bringing Order to the Web." Stanford InfoLab, Tech. Rep., 1999.Google Scholar
- S. Pal, J. Beaumont, D.-H. Park, A. Amarnath, S. Feng, C. Chakrabarti, H.-S. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, "OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator," in HPCA, 2018.Google Scholar
- G. Penn, "Efficient Transitive Closure of Sparse Matrices over Closed Semirings," AMiLP, 2006.Google Scholar
- A. Pinar and M. T. Heath, "Improving Performance of Sparse Matrix-Vector Multiplication," in SC, 1999.Google Scholar
- L. Ren, X. Chen, Y. Wang, C. Zhang, and H. Yang, "Sparse LU Factorization for Parallel Circuit Simulation on GPU," in DAC, 2012.Google Scholar
- Y. Saad, Iterative Methods for Sparse Linear Systems, 2003.Google Scholar
- D. Sanchez and C. Kozyrakis, "ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems," in ISCA, 2013.Google Scholar
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, "Scan Primitives for GPU Computing," in GH, 2007.Google Scholar
- V. Seshadri, G. Pekhimenko, O. Ruwase, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry, and T. Chilimbi, "Page Overlays: An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management," in ISCA, 2015.Google Scholar
- J. Shun and G. E. Blelloch, "Ligra: A Lightweight Graph Processing Framework for Shared Memory," in PPoPP, 2013.Google Scholar
- A. Smith. 6 New Facts About Facebook. http://mediashift.orgGoogle Scholar
- L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, "GraphR: Accelerating Graph Processing using ReRAM," in HPCA, 2018.Google Scholar
- B.-Y. Su and K. Keutzer, "clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs," in ICS, 2012.Google Scholar
- P. D. Sulatycke and K. Ghose, "Caching-Efficient Multithreaded Fast Multiplication of Sparse Matrices," in IPPS, 1998.Google Scholar
- S. Toledo, "Improving the Memory-System Performance of Sparse-Matrix Vector Multiplication," IBM Journal of research and development, 1997.Google Scholar
- P.-A. Tsai, Y. L. Gan, and D. Sanchez, "Rethinking the Memory Hierarchy for Modern Languages," in MICRO, 2018.Google Scholar
- P.-A. Tsai and D. Sanchez, "Compress Objects, Not Cache Lines: An Object-Based Compressed Memory Hierarchy," in ASPLOS, 2019.Google Scholar
- Y. Umuroglu and M. Jahre, "An Energy Efficient Column-major Backend for FPGA SpMV Accelerators," in ICCD, 2014.Google Scholar
- S. Van Dongen, "Graph Clustering via a Discrete Uncoupling Process," SIMAX, 2008.Google Scholar
- N. Vijaykumar, E. Ebrahimi, K. Hsieh, P. B. Gibbons, and O. Mutlu, "The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express Data Locality In GPUs," in ISCA, 2018.Google Scholar
- N. Vijaykumar, A. Jain, D. Majumdar, K. Hsieh, G. Pekhimenko, E. Ebrahimi, N. Hajinazar, P. B. Gibbons, and O. Mutlu, "A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory," in ISCA, 2018.Google Scholar
- R. W. Vuduc and H.-J. Moon, "Fast Sparse Matrix-vector Multiplication by Exploiting Variable Block Structure," in HPCC, 2005.Google Scholar
- Q. Wang, X. Zhang, Y. Zhang, and Q. Yi, "AUGEM: Automatically Generate High Performance Dense Linear Algebra Kernels on x86 CPUs," in SC, 2013.Google Scholar
- J. B. White and P. Sadayappan, "On Improving the Performance of Sparse Matrix-Vector Multiplication," in HiPC, 1997.Google Scholar
- J. Willcock and A. Lumsdaine, "Accelerating Sparse Matrix Computations via Data Compression," in ICS, 2006.Google Scholar
- S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms," in SC, 2007.Google Scholar
- T. Wu, B. Wang, Y. Shan, F. Yan, Y. Wang, and N. Xu, "Efficient PageRank and SpMV Computation on AMD GPUs," in ICPP, 2010.Google Scholar
- Z. Xianyi, W. Qian, and Z. Yunquan, "Model-driven Level 3 BLAS Performance Optimization on Loongson 3A processor," in ICPADS, 2012.Google Scholar
- S. Yan, C. Li, Y. Zhang, and H. Zhou, "yaSpMV: Yet Another SpMV Framework on GPUs," in PPoPP, 2014.Google Scholar
- L. Yavits and R. Ginosar, "Sparse Matrix Multiplication on CAM Based Accelerator," CoRR, 2017.Google Scholar
- M. Zhang, Y. Zhuo, C. Wang, M. Gao, Y. Wu, K. Chen, C. Kozyrakis, and X. Qian, "GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition," in HPCA, 2018.Google Scholar
- S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, "Cambricon-X: An Accelerator for Sparse Neural Networks," in MICRO, 2016.Google Scholar
- Y. Zhao, J. Li, C. Liao, and X. Shen, "Bridging the Gap between Deep Learning and Sparse Matrix Format Selection," in PPoPP, 2018.Google Scholar
- X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen, "Cambricon-S: Addressing Irregularity in Sparse Neural Networks Through a Cooperative Software/Hardware Approach," in MICRO, 2018.Google Scholar
- Q. Zhu, T. Graf, H. E. Sumbul, L. Pileggi, and F. Franchetti, "Accelerating Sparse Matrix-Matrix Multiplication with 3D-Stacked Logic-in-Memory Hardware," in HPEC, 2013.Google Scholar
- Y. Zhuo, C. Wang, M. Zhang, R. Wang, D. Niu, Y. Wang, and X. Qian, "GraphQ: Scalable PIM-Based Graph Processing," in MICRO, 2019.Google Scholar
Recommendations
Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate
CHOLMOD is a set of routines for factorizing sparse symmetric positive definite matrices of the form A or AAT, updating/downdating a sparse Cholesky factorization, solving linear systems, updating/downdating the solution to the triangular system Lx = b, ...
A sparse-sparse iteration for computing a sparse incomplete factorization of the inverse of an SPD matrix
In this paper, a method via sparse-sparse iteration for computing a sparse incomplete factorization of the inverse of a symmetric positive definite matrix is proposed. The resulting factorized sparse approximate inverse is used as a preconditioner for ...
Preconditioning Highly Indefinite and Nonsymmetric Matrices
Standard preconditioners, like incomplete factorizations, perform well when the coefficient matrix is diagonally dominant, but often fail on general sparse matrices. We experiment with nonsymmetric permutations and scalings aimed at placing large entries ...
Comments