research-article

swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures

Authors:
Xinliang Wang

Tsinghua University, China

Tsinghua University, China
View Profile

,
Weifeng Liu

Norwegian University of Science and Technology, Norway

Norwegian University of Science and Technology, Norway
View Profile

,
Wei Xue

Tsinghua University, China

Tsinghua University, China
View Profile

,
Li Wu

Tsinghua University, China

Tsinghua University, China
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 53 Issue 1January 2018pp 338–353https://doi.org/10.1145/3200691.3178513

Published:10 February 2018Publication History

ACM SIGPLAN Notices

Abstract

Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the out-of-control data reuse and high cost for global memory or shared cache access in inter-level synchronization have been largely neglected in existing work.

In this paper, we propose a novel data layout called Sparse Level Tile to make all data reuse under control, and design a Producer-Consumer pairing method to make any inter-level synchronization only happen in very fast register communication. We implement our data layout and algorithms on an SW26010 many-core processor, which is the main building-block of the current world fastest supercomputer Sunway Taihulight. The experimental results of testing all 2057 square matrices from the Florida Matrix Collection show that our method achieves an average speedup of 6.9 and the best speedup of 38.5 over parallel level-set method. Our method also outperforms the latest methods on a KNC many-core processor in 1856 matrices and the latest methods on a K80 GPU in 1672 matrices, respectively.

Supplemental Material

Available for Download

zip

swsptrsv-swsptrsv_1.0.zip (2.7 MB)

Scripts, Source Files

References

2017. https://www.top500.org/. (2017).Google Scholar
Edward Anderson and Youcef Saad. 1989. Solving sparse triangular linear systems on parallel computers. International Journal of High Speed Computing 1, 01 (1989), 73--95. Google ScholarDigital Library
Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, Haohuan Fu, Fang-fang Liu, Lin Gan, Ping Xu, and Wenjing Ma. 2017. 26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 535--544.Google Scholar
Martin Bättig and Thomas R. Gross. 2017. Synchronized-by-Default Concurrency for Shared-Memory Systems. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). 299--312. Google ScholarDigital Library
Andrew M. Bradley. 2016. A Hybrid Multithreaded Direct Sparse Triangular Solver. In 2016 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing. 13--22.Google ScholarCross Ref
Nathan G. Bronson, Jared Casper, Hassan Chafi, and Kunle Olukotun. 2010. A Practical Concurrent Binary Search Tree. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '10). 257--268. Google ScholarDigital Library
R. Das, M. Uysal, J. Saltz, and Y.S. Hwang. 1994. Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures. J. Parallel and Distrib. Comput. 22, 3 (1994), 462 -- 478. Google ScholarDigital Library
Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1 (2011), 1:1--1:25. Google ScholarDigital Library
Jack Dongarra. 2016. Report on the sunway taihulight system. www.netlib.org. Retrieved June 20 (2016).Google Scholar
Alexandre X. Duchateau, David Padua, and Denis Barthou. 2013. Hydra: Automatic Algorithm Exploration from Linear Algebra Equations. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO '13). 1--10. Google ScholarDigital Library
Iain S. Duff, Albert M. Erisman, and John K. Reid. 2017. Direct Methods for Sparse Matrices (2nd ed.). Oxford University Press, Inc.Google Scholar
Jiarui Fang, Haohuan Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, and Guangwen Yang. 2017. swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 615--624.Google ScholarCross Ref
Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, et al. 2016. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences 59, 7 (2016), 072001.Google ScholarCross Ref
Elad Gidron, Idit Keidar, Dmitri Perelman, and Yonathan Perez. 2012. SALSA: Scalable and Low Synchronization NUMA-aware Algorithm for Producer-consumer Pools. In Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12). 151--160. Google ScholarDigital Library
Hwansoo Han and Chau-Wen Tseng. 2006. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel and Distributed Systems 17, 7 (2006), 606--618. Google ScholarDigital Library
Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. 2017. Fast Segmented Sort on GPUs. In Proceedings of the International Conference on Supercomputing (ICS '17). Article 12, 12:1--12:10 pages. Google ScholarDigital Library
Eun-Jin Im and Katherine Yelick. 1998. Model-based memory hierarchy optimizations for sparse matrices. In Workshop on Profile and Feedback-Directed Compilation, Vol. 139.Google Scholar
Lijuang Jiang, Chao Yang, Yulong Ao, Wanwang Yin, Wenjing Ma, Qiao Sun, Fangfang Liu, Rongfen Lin, and Peng Zhang. 2017. Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In International Conference on Parallel Processing (ICPP), 2017 IEEE International. IEEE.Google ScholarCross Ref
Humayun Kabir, Joshua Dennis Booth, Guillaume Aupy, Anne Benoit, Yves Robert, and Padma Raghavan. 2015. STS-k: a multilevel sparse triangular solution scheme for NUMA multicores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, 55. Google ScholarDigital Library
R. Kaleem, A. Venkat, S. Pai, M. Hall, and K. Pingali. 2016. Synchronization Trade-Offs in GPU Implementations of Graph Algorithms. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 514--523.Google Scholar
Saurabh Kalikar and Rupesh Nasre. 2016. DomLock: A New Multi-granularity Locking Technique for Hierarchies. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '16). 23:1--23:12. Google ScholarDigital Library
Alex Kogan and Maurice Herlihy. 2014. The Future(s) of Shared Data Structures. In Proceedings of the 2014 ACM Symposium on Principles of Distributed Computing (PODC '14). 30--39. Google ScholarDigital Library
Ang Li, Weifeng Liu, Mads R. B. Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song. 2017. Exploring and Analyzing the Real Impact of Modern On-package Memory on HPC Scientific Kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). 26:1--26:14. Google ScholarDigital Library
Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An Input Adaptive Auto-tuner for Sparse Matrix-vector Multiplication. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). 117--126. Google ScholarDigital Library
Ruipeng Li and Yousef Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing 63, 2 (2013), 443--466. Google ScholarDigital Library
Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong Zhai, Wanwang Yin, and Weimin Zheng. 2017. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 635--645.Google Scholar
James Lin, Zhigeng Xu, Akira Nukada, Naoya Maruyama, and Satoshi Matsuoka. 2017. Optimizations of Two Compute-bound Scientific Kernels on the SW26010 Many-core Processor. In International Conference on Parallel Processing (ICPP), 2017 IEEE International. IEEE.Google ScholarCross Ref
Weifeng Liu. 2015. Parallel and Scalable Sparse Basic Linear Algebra Subprograms. Ph.D. Dissertation. University of Copenhagen.Google Scholar
Weifeng Liu, Ang Li, Jonathan Hogg, Iain S. Duff, and Brian Vinter. 2016. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves. In European Conference on Parallel Processing. 617--630. Google ScholarDigital Library
Weifeng Liu, Ang Li, Jonathan D. Hogg, Iain S. Duff, and Brian Vinter. 2017. Fast Synchronization-Free Algorithms for Parallel Sparse Triangular Solves with Multiple Right-Hand Sides. Concurrency and Computation: Practice and Experience 29, 21 (2017), e4244-n/a.Google ScholarCross Ref
Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS '15). 339--350. Google ScholarDigital Library
Weifeng Liu and Brian Vinter. 2015. A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors. J. Parallel and Distrib. Comput. 85, C (Nov. 2015), 47--61. Google ScholarDigital Library
Weifeng Liu and Brian Vinter. 2015. Speculative Segmented Sum for Sparse Matrix-vector Multiplication on Heterogeneous Processors. Parallel Comput. 49, C (Nov. 2015), 179--193. Google ScholarDigital Library
Zhiyu Liu, Irina Calciu, Maurice Herlihy, and Onur Mutlu. 2017. Concurrent Data Structures for Near-Memory Computing. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '17). 235--245. Google ScholarDigital Library
Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2010. GraphLab: A New Parallel Framework for Machine Learning. In Conference on Uncertainty in Artificial Intelligence (UAI). Google ScholarDigital Library
Zoltan Majo and Thomas R. Gross. 2017. A Library for Portable and Composable Data Locality Optimizations for NUMA Systems. ACM Trans. Parallel Comput. 3, 4 (March 2017), 20:1--20:32. Google ScholarDigital Library
Jan Mayer. 2009. Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86, 4 (2009), 291--312. Google ScholarDigital Library
Adam Morrison. 2016. Scaling Synchronization in Multicore Programs. Commun. ACM 59, 11 (2016), 44--51. Google ScholarDigital Library
Maxim Naumov. 2011. Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. NVIDIA Corp., Westford, MA, USA, Tech. Rep. NVR-2011 1 (2011).Google Scholar
Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In International Supercomputing Conference. 124--140. Google ScholarDigital Library
A. Picciau, G. E. Inggs, J. Wickerson, E. C. Kerrigan, and G. A. Constantinides. 2016. Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs. In 2016 IEEE 23rd International Conference on High Performance Computing (HiPC). 183--192.Google Scholar
I. Z. Reguly, G. R. Mudalige, C. Bertolli, M. B. Giles, A. Betts, P. H. J. Kelly, and D. Radford. 2016. Acceleration of a Full-Scale Industrial CFD Application with OP2. IEEE Transactions on Parallel and Distributed Systems 27, 5 (May 2016), 1265--1278. Google ScholarDigital Library
B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen, and W. Schulte. 2013. SIMD parallelization of applications that traverse irregular data structures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1--10. Google ScholarDigital Library
Yousef Saad. 2003. Iterative methods for sparse linear systems. Siam. Google ScholarDigital Library
Joel H Saltz. 1990. Aggregation methods for solving sparse triangular systems on multiprocessors. SIAM journal on scientific and statistical computing 11, 1 (1990), 123--144.Google Scholar
Robert Schreiber and Wei-Pei Tang. 1982. Vectorizing the conjugate gradient method. Unpublished manuscript, Department of Computer Science, Stanford University (1982).Google Scholar
Michelle Mills Strout, Larry Carter, and Jeanne Ferrante. 2003. Compile-time Composition of Run-time Data and Iteration Reorderings. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI '03). 91--102. Google ScholarDigital Library
Michelle Mills Strout, Larry Carter, Jeanne Ferrante, Jonathan Freeman, and Barbara Kreaseck. 2002. Combining Performance Aspects of Irregular Gauss-Seidel Via Sparse Tiling. In Languages and Compilers for Parallel Computing: 15th Workshop, LCPC 2002, College Park, MD, USA, July 25-27, 2002. Revised Papers, Bill Pugh and Chau-Wen Tseng (Eds.). 90--110. Google ScholarDigital Library
Michelle Mills Strout, Larry Carter, Jeanne Ferrante, and Barbara Kreaseck. 2004. Sparse Tiling for Stationary Iterative Methods. The International Journal of High Performance Computing Applications 18, 1 (2004), 95--113. Google ScholarDigital Library
Michelle Mills Strout, Alan LaMielle, Larry Carter, Jeanne Ferrante, Barbara Kreaseck, and Catherine Olschanowsky. 2016. An Approach for Code Generation in the Sparse Polyhedral Framework. Parallel Comput. 53, C (April 2016), 32--57. Google ScholarDigital Library
Brad Suchoski, Caleb Severn, Manu Shantharam, and Padma Raghavan. 2012. Adapting sparse triangular solution to GPUs. In 2012 41st International Conference on Parallel Processing Workshops. IEEE, 140--148. Google ScholarDigital Library
Guangming Tan, Junhong Liu, and Jiajia Li. 2018. Design and Implementation of Adaptive SpMV Library for Multicore and Manycore Architecture. ACM Trans. Math. Softw. (2018).Google Scholar
D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericas. 2017. Trends in Data Locality Abstractions for HPC Systems. IEEE Transactions on Parallel and Distributed Systems PP, 99 (2017), 1--1.Google Scholar
Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and Data Transformations for Sparse Matrix Code. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '15). 521--532. Google ScholarDigital Library
Anand Venkat, Mahdi Soltan Mohammadi, Jongsoo Park, Hongbo Rong, Rajkishore Barik, Michelle Mills Strout, and Mary Hall. 2016. Automating Wavefront Parallelization for Sparse Matrix Computations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). 41:1--41:12. Google ScholarDigital Library
Richard Vuduc. 2003. Automatic Performance Tuning of Sparse Matrix Kernels. Ph.D. Dissertation. University of California, Berkeley. Google ScholarDigital Library
Richard Vuduc, Aparna Chandramowlishwaran, Jee Choi, Murat Guney, and Aashay Shringarpure. 2010. On the Limits of GPU Acceleration. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Parallelism (HotPar'10). 13--13. Google ScholarDigital Library
Richard Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, and Benjamin Lee. 2002. Performance Optimizations and Bounds for Sparse Matrix-vector Multiply. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC '02). 1--35. Google ScholarDigital Library
Richard Vuduc, Shoaib Kamil, Jen Hsu, Rajesh Nishtala, James W Demmel, and Katherine A Yelick. 2002. Automatic Performance Tuning and Analysis of Sparse Triangular Solve. In ICS 2002: Workshop on Performance Optimization via High-Level Languages and Libraries.Google Scholar
Hao Wang, Weifeng Liu, Kaixi Hou, and Wu-chun Feng. 2016. Parallel Transposition of Sparse Data Structures. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16). 33:1--33:13. Google ScholarDigital Library
Xin Wang, Weihua Zhang, Zhaoguo Wang, Ziyun Wei, Haibo Chen, and Wenyun Zhao. 2017. Eunomia: Scaling Concurrent Search Trees under Contention Using HTM. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 385--399. Google ScholarDigital Library
Michael M Wolf, Michael A Heroux, and Erik G Boman. 2010. Factors impacting performance of multithreaded sparse triangular solve. In International Conference on High Performance Computing for Computational Science. Springer, 32--44. Google ScholarDigital Library
Zhigeng Xu, James Lin, and Satoshi Matsuoka. 2017. Benchmarking SW26010 Many-Core Processor. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, 743--752.Google ScholarCross Ref
Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet Another SpMV Framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). 107--118. Google ScholarDigital Library
Shengen Yan, Guoping Long, and Yunquan Zhang. 2013. StreamScan: Fast Scan Algorithms for GPUs Without Global Barrier Synchronization. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13). 229--238. Google ScholarDigital Library
Chao Yang, Wei Xue, Haohuan Fu, Hongtao You, Xinliang Wang, Yu-long Ao, Fangfang Liu, Lin Gan, Ping Xu, Lanning Wang, et al. 2016. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). 6:1--6:12. Google ScholarDigital Library
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly Elimination of Dynamic Irregularities for GPU Computing. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). 369--380. Google ScholarDigital Library
Weihua Zhang, Xin Wang, Shiyu Ji, Ziyun Wei, Zhaoguo Wang, and Haibo Chen. 2017. Scaling Concurrent Index Structures under Contention Using HTM. IEEE Transactions on Parallel and Distributed Systems (2017).Google Scholar
Yuanrui Zhang, Wei Ding, Jun Liu, and Mahmut Kandemir. 2011. Optimizing Data Layouts for Parallel Computation on Multicores. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT '11). 143--154. Google ScholarDigital Library
Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018. Bridging the Gap between Deep Learning and Sparse Matrix Format Selection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). Google ScholarDigital Library

Index Terms

swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures
1. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms

Recommendations

Efficient Block Algorithms for Parallel Sparse Triangular Solve
ICPP '20: Proceedings of the 49th International Conference on Parallel Processing

The sparse triangular solve (SpTRSV) kernel is an important building block for a number of linear algebra routines such as sparse direct and iterative solvers. The major challenge of accelerating SpTRSV lies in the difficulties of finding higher ...
Read More
swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the ...
Read More
Towards Efficient SpMV on Sunway Manycore Architectures
ICS '18: Proceedings of the 2018 International Conference on Supercomputing

Sparse Matrix-Vector Multiplication (SpMV) is an essential computation kernel for many data-analytic workloads running in both supercomputers and data centers. The intrinsic irregularity in SpMV is challenging to achieve high performance, especially ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 53, Issue 1
PPoPP '18
January 2018
426 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3200691
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2018
442 pages
ISBN:9781450349826
DOI:10.1145/3178487
General Chair:
Andreas Krall
Vienna University of Technology, Austria
,
Program Chair:
Thomas R. Gross
ETH Zürich, Switzerland
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 February 2018
Check for updates
Badges
Author Tags
sparse level tile
sparse matrix
sparse triangular solve
sunway architecture
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 58
  Total Citations
  View Citations
- 629
  Total Downloads
- Downloads (Last 12 months)58
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures

ACM SIGPLAN Notices

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Efficient Block Algorithms for Parallel Sparse Triangular Solve

swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures

Towards Efficient SpMV on Sunway Manycore Architectures