Abstract
Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the out-of-control data reuse and high cost for global memory or shared cache access in inter-level synchronization have been largely neglected in existing work.
In this paper, we propose a novel data layout called Sparse Level Tile to make all data reuse under control, and design a Producer-Consumer pairing method to make any inter-level synchronization only happen in very fast register communication. We implement our data layout and algorithms on an SW26010 many-core processor, which is the main building-block of the current world fastest supercomputer Sunway Taihulight. The experimental results of testing all 2057 square matrices from the Florida Matrix Collection show that our method achieves an average speedup of 6.9 and the best speedup of 38.5 over parallel level-set method. Our method also outperforms the latest methods on a KNC many-core processor in 1856 matrices and the latest methods on a K80 GPU in 1672 matrices, respectively.
Supplemental Material
Available for Download
Scripts, Source Files
- 2017. https://www.top500.org/. (2017).Google Scholar
- Edward Anderson and Youcef Saad. 1989. Solving sparse triangular linear systems on parallel computers. International Journal of High Speed Computing 1, 01 (1989), 73--95. Google ScholarDigital Library
- Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, Haohuan Fu, Fang-fang Liu, Lin Gan, Ping Xu, and Wenjing Ma. 2017. 26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 535--544.Google Scholar
- Martin Bättig and Thomas R. Gross. 2017. Synchronized-by-Default Concurrency for Shared-Memory Systems. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). 299--312. Google ScholarDigital Library
- Andrew M. Bradley. 2016. A Hybrid Multithreaded Direct Sparse Triangular Solver. In 2016 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing. 13--22.Google ScholarCross Ref
- Nathan G. Bronson, Jared Casper, Hassan Chafi, and Kunle Olukotun. 2010. A Practical Concurrent Binary Search Tree. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '10). 257--268. Google ScholarDigital Library
- R. Das, M. Uysal, J. Saltz, and Y.S. Hwang. 1994. Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures. J. Parallel and Distrib. Comput. 22, 3 (1994), 462 -- 478. Google ScholarDigital Library
- Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1 (2011), 1:1--1:25. Google ScholarDigital Library
- Jack Dongarra. 2016. Report on the sunway taihulight system. www.netlib.org. Retrieved June 20 (2016).Google Scholar
- Alexandre X. Duchateau, David Padua, and Denis Barthou. 2013. Hydra: Automatic Algorithm Exploration from Linear Algebra Equations. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO '13). 1--10. Google ScholarDigital Library
- Iain S. Duff, Albert M. Erisman, and John K. Reid. 2017. Direct Methods for Sparse Matrices (2nd ed.). Oxford University Press, Inc.Google Scholar
- Jiarui Fang, Haohuan Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, and Guangwen Yang. 2017. swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 615--624.Google ScholarCross Ref
- Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, et al. 2016. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences 59, 7 (2016), 072001.Google ScholarCross Ref
- Elad Gidron, Idit Keidar, Dmitri Perelman, and Yonathan Perez. 2012. SALSA: Scalable and Low Synchronization NUMA-aware Algorithm for Producer-consumer Pools. In Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12). 151--160. Google ScholarDigital Library
- Hwansoo Han and Chau-Wen Tseng. 2006. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel and Distributed Systems 17, 7 (2006), 606--618. Google ScholarDigital Library
- Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. 2017. Fast Segmented Sort on GPUs. In Proceedings of the International Conference on Supercomputing (ICS '17). Article 12, 12:1--12:10 pages. Google ScholarDigital Library
- Eun-Jin Im and Katherine Yelick. 1998. Model-based memory hierarchy optimizations for sparse matrices. In Workshop on Profile and Feedback-Directed Compilation, Vol. 139.Google Scholar
- Lijuang Jiang, Chao Yang, Yulong Ao, Wanwang Yin, Wenjing Ma, Qiao Sun, Fangfang Liu, Rongfen Lin, and Peng Zhang. 2017. Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In International Conference on Parallel Processing (ICPP), 2017 IEEE International. IEEE.Google ScholarCross Ref
- Humayun Kabir, Joshua Dennis Booth, Guillaume Aupy, Anne Benoit, Yves Robert, and Padma Raghavan. 2015. STS-k: a multilevel sparse triangular solution scheme for NUMA multicores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, 55. Google ScholarDigital Library
- R. Kaleem, A. Venkat, S. Pai, M. Hall, and K. Pingali. 2016. Synchronization Trade-Offs in GPU Implementations of Graph Algorithms. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 514--523.Google Scholar
- Saurabh Kalikar and Rupesh Nasre. 2016. DomLock: A New Multi-granularity Locking Technique for Hierarchies. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '16). 23:1--23:12. Google ScholarDigital Library
- Alex Kogan and Maurice Herlihy. 2014. The Future(s) of Shared Data Structures. In Proceedings of the 2014 ACM Symposium on Principles of Distributed Computing (PODC '14). 30--39. Google ScholarDigital Library
- Ang Li, Weifeng Liu, Mads R. B. Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song. 2017. Exploring and Analyzing the Real Impact of Modern On-package Memory on HPC Scientific Kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). 26:1--26:14. Google ScholarDigital Library
- Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An Input Adaptive Auto-tuner for Sparse Matrix-vector Multiplication. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). 117--126. Google ScholarDigital Library
- Ruipeng Li and Yousef Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing 63, 2 (2013), 443--466. Google ScholarDigital Library
- Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong Zhai, Wanwang Yin, and Weimin Zheng. 2017. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 635--645.Google Scholar
- James Lin, Zhigeng Xu, Akira Nukada, Naoya Maruyama, and Satoshi Matsuoka. 2017. Optimizations of Two Compute-bound Scientific Kernels on the SW26010 Many-core Processor. In International Conference on Parallel Processing (ICPP), 2017 IEEE International. IEEE.Google ScholarCross Ref
- Weifeng Liu. 2015. Parallel and Scalable Sparse Basic Linear Algebra Subprograms. Ph.D. Dissertation. University of Copenhagen.Google Scholar
- Weifeng Liu, Ang Li, Jonathan Hogg, Iain S. Duff, and Brian Vinter. 2016. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves. In European Conference on Parallel Processing. 617--630. Google ScholarDigital Library
- Weifeng Liu, Ang Li, Jonathan D. Hogg, Iain S. Duff, and Brian Vinter. 2017. Fast Synchronization-Free Algorithms for Parallel Sparse Triangular Solves with Multiple Right-Hand Sides. Concurrency and Computation: Practice and Experience 29, 21 (2017), e4244-n/a.Google ScholarCross Ref
- Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS '15). 339--350. Google ScholarDigital Library
- Weifeng Liu and Brian Vinter. 2015. A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors. J. Parallel and Distrib. Comput. 85, C (Nov. 2015), 47--61. Google ScholarDigital Library
- Weifeng Liu and Brian Vinter. 2015. Speculative Segmented Sum for Sparse Matrix-vector Multiplication on Heterogeneous Processors. Parallel Comput. 49, C (Nov. 2015), 179--193. Google ScholarDigital Library
- Zhiyu Liu, Irina Calciu, Maurice Herlihy, and Onur Mutlu. 2017. Concurrent Data Structures for Near-Memory Computing. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '17). 235--245. Google ScholarDigital Library
- Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2010. GraphLab: A New Parallel Framework for Machine Learning. In Conference on Uncertainty in Artificial Intelligence (UAI). Google ScholarDigital Library
- Zoltan Majo and Thomas R. Gross. 2017. A Library for Portable and Composable Data Locality Optimizations for NUMA Systems. ACM Trans. Parallel Comput. 3, 4 (March 2017), 20:1--20:32. Google ScholarDigital Library
- Jan Mayer. 2009. Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86, 4 (2009), 291--312. Google ScholarDigital Library
- Adam Morrison. 2016. Scaling Synchronization in Multicore Programs. Commun. ACM 59, 11 (2016), 44--51. Google ScholarDigital Library
- Maxim Naumov. 2011. Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. NVIDIA Corp., Westford, MA, USA, Tech. Rep. NVR-2011 1 (2011).Google Scholar
- Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In International Supercomputing Conference. 124--140. Google ScholarDigital Library
- A. Picciau, G. E. Inggs, J. Wickerson, E. C. Kerrigan, and G. A. Constantinides. 2016. Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs. In 2016 IEEE 23rd International Conference on High Performance Computing (HiPC). 183--192.Google Scholar
- I. Z. Reguly, G. R. Mudalige, C. Bertolli, M. B. Giles, A. Betts, P. H. J. Kelly, and D. Radford. 2016. Acceleration of a Full-Scale Industrial CFD Application with OP2. IEEE Transactions on Parallel and Distributed Systems 27, 5 (May 2016), 1265--1278. Google ScholarDigital Library
- B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen, and W. Schulte. 2013. SIMD parallelization of applications that traverse irregular data structures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1--10. Google ScholarDigital Library
- Yousef Saad. 2003. Iterative methods for sparse linear systems. Siam. Google ScholarDigital Library
- Joel H Saltz. 1990. Aggregation methods for solving sparse triangular systems on multiprocessors. SIAM journal on scientific and statistical computing 11, 1 (1990), 123--144.Google Scholar
- Robert Schreiber and Wei-Pei Tang. 1982. Vectorizing the conjugate gradient method. Unpublished manuscript, Department of Computer Science, Stanford University (1982).Google Scholar
- Michelle Mills Strout, Larry Carter, and Jeanne Ferrante. 2003. Compile-time Composition of Run-time Data and Iteration Reorderings. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI '03). 91--102. Google ScholarDigital Library
- Michelle Mills Strout, Larry Carter, Jeanne Ferrante, Jonathan Freeman, and Barbara Kreaseck. 2002. Combining Performance Aspects of Irregular Gauss-Seidel Via Sparse Tiling. In Languages and Compilers for Parallel Computing: 15th Workshop, LCPC 2002, College Park, MD, USA, July 25-27, 2002. Revised Papers, Bill Pugh and Chau-Wen Tseng (Eds.). 90--110. Google ScholarDigital Library
- Michelle Mills Strout, Larry Carter, Jeanne Ferrante, and Barbara Kreaseck. 2004. Sparse Tiling for Stationary Iterative Methods. The International Journal of High Performance Computing Applications 18, 1 (2004), 95--113. Google ScholarDigital Library
- Michelle Mills Strout, Alan LaMielle, Larry Carter, Jeanne Ferrante, Barbara Kreaseck, and Catherine Olschanowsky. 2016. An Approach for Code Generation in the Sparse Polyhedral Framework. Parallel Comput. 53, C (April 2016), 32--57. Google ScholarDigital Library
- Brad Suchoski, Caleb Severn, Manu Shantharam, and Padma Raghavan. 2012. Adapting sparse triangular solution to GPUs. In 2012 41st International Conference on Parallel Processing Workshops. IEEE, 140--148. Google ScholarDigital Library
- Guangming Tan, Junhong Liu, and Jiajia Li. 2018. Design and Implementation of Adaptive SpMV Library for Multicore and Manycore Architecture. ACM Trans. Math. Softw. (2018).Google Scholar
- D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericas. 2017. Trends in Data Locality Abstractions for HPC Systems. IEEE Transactions on Parallel and Distributed Systems PP, 99 (2017), 1--1.Google Scholar
- Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and Data Transformations for Sparse Matrix Code. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '15). 521--532. Google ScholarDigital Library
- Anand Venkat, Mahdi Soltan Mohammadi, Jongsoo Park, Hongbo Rong, Rajkishore Barik, Michelle Mills Strout, and Mary Hall. 2016. Automating Wavefront Parallelization for Sparse Matrix Computations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). 41:1--41:12. Google ScholarDigital Library
- Richard Vuduc. 2003. Automatic Performance Tuning of Sparse Matrix Kernels. Ph.D. Dissertation. University of California, Berkeley. Google ScholarDigital Library
- Richard Vuduc, Aparna Chandramowlishwaran, Jee Choi, Murat Guney, and Aashay Shringarpure. 2010. On the Limits of GPU Acceleration. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Parallelism (HotPar'10). 13--13. Google ScholarDigital Library
- Richard Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, and Benjamin Lee. 2002. Performance Optimizations and Bounds for Sparse Matrix-vector Multiply. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC '02). 1--35. Google ScholarDigital Library
- Richard Vuduc, Shoaib Kamil, Jen Hsu, Rajesh Nishtala, James W Demmel, and Katherine A Yelick. 2002. Automatic Performance Tuning and Analysis of Sparse Triangular Solve. In ICS 2002: Workshop on Performance Optimization via High-Level Languages and Libraries.Google Scholar
- Hao Wang, Weifeng Liu, Kaixi Hou, and Wu-chun Feng. 2016. Parallel Transposition of Sparse Data Structures. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16). 33:1--33:13. Google ScholarDigital Library
- Xin Wang, Weihua Zhang, Zhaoguo Wang, Ziyun Wei, Haibo Chen, and Wenyun Zhao. 2017. Eunomia: Scaling Concurrent Search Trees under Contention Using HTM. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 385--399. Google ScholarDigital Library
- Michael M Wolf, Michael A Heroux, and Erik G Boman. 2010. Factors impacting performance of multithreaded sparse triangular solve. In International Conference on High Performance Computing for Computational Science. Springer, 32--44. Google ScholarDigital Library
- Zhigeng Xu, James Lin, and Satoshi Matsuoka. 2017. Benchmarking SW26010 Many-Core Processor. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, 743--752.Google ScholarCross Ref
- Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet Another SpMV Framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). 107--118. Google ScholarDigital Library
- Shengen Yan, Guoping Long, and Yunquan Zhang. 2013. StreamScan: Fast Scan Algorithms for GPUs Without Global Barrier Synchronization. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13). 229--238. Google ScholarDigital Library
- Chao Yang, Wei Xue, Haohuan Fu, Hongtao You, Xinliang Wang, Yu-long Ao, Fangfang Liu, Lin Gan, Ping Xu, Lanning Wang, et al. 2016. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). 6:1--6:12. Google ScholarDigital Library
- Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly Elimination of Dynamic Irregularities for GPU Computing. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). 369--380. Google ScholarDigital Library
- Weihua Zhang, Xin Wang, Shiyu Ji, Ziyun Wei, Zhaoguo Wang, and Haibo Chen. 2017. Scaling Concurrent Index Structures under Contention Using HTM. IEEE Transactions on Parallel and Distributed Systems (2017).Google Scholar
- Yuanrui Zhang, Wei Ding, Jun Liu, and Mahmut Kandemir. 2011. Optimizing Data Layouts for Parallel Computation on Multicores. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT '11). 143--154. Google ScholarDigital Library
- Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018. Bridging the Gap between Deep Learning and Sparse Matrix Format Selection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). Google ScholarDigital Library
Index Terms
- swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures
Recommendations
Efficient Block Algorithms for Parallel Sparse Triangular Solve
ICPP '20: Proceedings of the 49th International Conference on Parallel ProcessingThe sparse triangular solve (SpTRSV) kernel is an important building block for a number of linear algebra routines such as sparse direct and iterative solvers. The major challenge of accelerating SpTRSV lies in the difficulties of finding higher ...
swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingSparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the ...
Towards Efficient SpMV on Sunway Manycore Architectures
ICS '18: Proceedings of the 2018 International Conference on SupercomputingSparse Matrix-Vector Multiplication (SpMV) is an essential computation kernel for many data-analytic workloads running in both supercomputers and data centers. The intrinsic irregularity in SpMV is challenging to achieve high performance, especially ...
Comments