research-article

Public Access

Parallel Transposition of Sparse Data Structures

Authors:

Wu-chun FengAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 33, Pages 1 - 13

https://doi.org/10.1145/2925426.2926291

Published: 01 June 2016 Publication History

Abstract

Many applications in computational sciences and social sciences exploit sparsity and connectivity of acquired data. Even though many parallel sparse primitives such as sparse matrix-vector (SpMV) multiplication have been extensively studied, some other important building blocks, e.g., parallel transposition for sparse matrices and graphs, have not received the attention they deserve.

In this paper, we first identify that the transposition operation can be a bottleneck of some fundamental sparse matrix and graph algorithms. Then, we revisit the performance and scalability of parallel transposition approaches on x86-based multi-core and many-core processors. Based on the insights obtained, we propose two new parallel transposition algorithms: ScanTrans and MergeTrans. The experimental results show that our ScanTrans method achieves an average of 2.8-fold (up to 6.2-fold) speedup over the parallel transposition in the latest vendor-supplied library on an Intel multi-core CPU platform, and the MergeTrans approach achieves on average of 3.4-fold (up to 11.7-fold) speedup on an Intel Xeon Phi many-core processor.

References

[1]

A. V. Aho, J. E. Hopcroft, and J. Ullman. Data Structures and Algorithms. Addison-Wesley Longman Publishing Co., Inc., 1983.

Digital Library

[2]

P. R. Amestoy, T. A. Davis, and I. S. Duff. An Approximate Minimum Degree Ordering Algorithm. SIAM Journal on Matrix Analysis and Applications, 17(4):886--905, 1996.

Digital Library

[3]

A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 781--792. IEEE, 2014.

Digital Library

[4]

A. Ashari, N. Sedaghati, J. Eisenlohr, and P. Sadayappan. An Efficient Two-Dimensional Blocking Strategy for Sparse Matrix-vector Multiplication on GPUs. In Proceedings of the 28th ACM International Conference on Supercomputing, ICS '14, pages 273--282. ACM, 2014.

Digital Library

[5]

S. Ashkiani, A. Davidson, U. Meyer, and J. D. Owens. GPU Multisplit. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '16. ACM, 2016.

Digital Library

[6]

D. A. Bader, H. Meyerhenke, P. Sanders, and D. Wagner. Graph Partitioning and Graph Clustering. American Mathematical Soc., 2013.

[7]

Å. Björck. Numerical Methods for Least Squares Problems. Society for Industrial and Applied Mathematics, 1996.

[8]

A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson. Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks. In Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures, SPAA '09, pages 233--244. ACM, 2009.

Digital Library

[9]

A. Buluç and J. R. Gilbert. Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication. In Proceedings of the 37th International Conference on Parallel Processing, ICPP '08, pages 503--510. IEEE, 2008.

Digital Library

[10]

A. Buluç and J. R. Gilbert. On the Representation and Multiplication of Hypersparse Matrices. In Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing, IPDPS '08, pages 1--11. IEEE, 2008.

[11]

A. Buluç and J. R. Gilbert. The Combinatorial BLAS: Design, Implementation, and Applications. International Journal of High Performance Computing Applications, 25(4):496--509, 2011.

Digital Library

[12]

A. Buluç, S. Williams, L. Oliker, and J. Demmel. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium, IPDPS '11, pages 721--733. IEEE, 2011.

Digital Library

[13]

B. Catanzaro, A. Keller, and M. Garland. A Decomposition for In-place Matrix Transposition. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 193--206. ACM, 2014.

Digital Library

[14]

D. Church, V. Schneider, K. Steinberg, M. Schatz, A. Quinlan, C.-S. Chin, P. Kitts, B. Aken, G. Marth, M. Hoffman, J. Herrero, M. L. Mendoza, R. Durbin, and P. Flicek. Extending Reference Assembly Models. Genome Biology, 16(1):13, 2015.

[15]

T. A. Davis. Direct Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, 2006.

Digital Library

[16]

T. A. Davis and Y. Hu. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw., 38(1):1:1--1:25, dec 2011.

Digital Library

[17]

F. Dellaert and M. Kaess. Square Root SAM: Simultaneous Localization and Mapping via Square Root Information Smoothing. The International Journal of Robotics Research, 25(12):1181--1203, 2006.

Digital Library

[18]

Y. Dotsenko, S. S. Baghsorkhi, B. Lloyd, and N. K. Govindaraju. Auto-tuning of Fast Fourier Transform on Graphics Processors. In Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 257--266. ACM, 2011.

Digital Library

[19]

I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Oxford University Press, Inc., 1986.

Digital Library

[20]

R. Fletcher. Conjugate Gradient Methods for Indefinite Systems. In Numerical Analysis, Lecture Notes in Mathematics, pages 73--89. Springer Berlin Heidelberg, 1976.

[21]

R. W. Freund. A Transpose-Free Quasi-Minimal Residual Algorithm for Non-Hermitian Linear Systems. SIAM Journal on Scientific Computing, 14(2):470--482, 1993.

Digital Library

[22]

R. W. Freund and N. M. Nachtigal. QMR: a Quasi-Minimal Residual Method for Non-Hermitian Linear Systems. Numerische Mathematik, 60(1):315--339, 1991.

Digital Library

[23]

E. Georganas, A. Buluç, J. Chapman, S. Hofmeyr, C. Aluru, R. Egan, L. Oliker, D. Rokhsar, and K. Yelick. HipMer: An Extreme-scale De Novo Genome Assembler. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, pages 14:1--14:11. IEEE, 2015.

Digital Library

[24]

O. Green, R. McColl, and D. A. Bader. GPU Merge Path - A GPU Merging Algorithm. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS '12, pages 331--340. ACM, 2012.

Digital Library

[25]

F. Gustavson, L. Karlsson, and B. Kågström. Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion. ACM Trans. Math. Softw., 38(3):17:1--17:32, 2012.

Digital Library

[26]

F. G. Gustavson. Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition. ACM Trans. Math. Softw., 4(3):250--269, 1978.

Digital Library

[27]

S. Hong, N. C. Rodia, and K. Olukotun. On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-world Graphs. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 92:1--92:11. IEEE, 2013.

Digital Library

[28]

K. Hou, H. Wang, and W.-c. Feng. ASPaS: A Framework for Automatic SIMDization of Parallel Sorting on x86-based Many-core Processors. In Proceedings of the 29th ACM International Conference on Supercomputing, ICS '15, pages 383--392. ACM, 2015.

Digital Library

[29]

K. Hou, H. Wang, and W.-c. Feng. AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, IPDPS '16. IEEE, 2016.

[30]

H. Inoue and K. Taura. SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures. Proceedings of the VLDB Endowment, 8(11):1274--1285, 2015.

Digital Library

[31]

H. Kabir, J. D. Booth, G. Aupy, A. Benoit, Y. Robert, and P. Raghavan. STS-k: A Multilevel Sparse Triangular Solution Scheme for NUMA Multicores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, pages 55:1--55:11. IEEE, 2015.

Digital Library

[32]

J. J. Leonard, H. F. Durrant-Whyte, and I. J. Cox. Dynamic Map Building for an Autonomous Mobile Robot. Int. J. Rob. Res., 11(4):286--298, 1992.

Digital Library

[33]

W. Liu and B. Vinter. A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors. Journal of Parallel and Distributed Computing, 85:47--61, 2015.

Digital Library

[34]

W. Liu and B. Vinter. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing, ICS '15, pages 339--350. ACM, 2015.

Digital Library

[35]

W. Liu and B. Vinter. Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors. Parallel Computing, 49:179--193, 2015.

Digital Library

[36]

D. S. McFarlin, V. Arbatov, F. Franchetti, and M. Püschel. Automatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction Sets. In Proceedings of the 25th ACM International Conference on Supercomputing, ICS '11, pages 265--274. ACM, 2011.

Digital Library

[37]

W. McLendon III, B. Hendrickson, S. J. Plimpton, and L. Rauchwerger. Finding Strongly Connected Components in Distributed Graphs. J. Parallel Distrib. Comput., 65(8):901--910, 2005.

Digital Library

[38]

M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson, S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey. Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms. In High Performance Computing, volume 9137, pages 48--57. Springer, 2015.

[39]

S. Ramos and T. Hoefler. Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi. In Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC '13, pages 97--108. ACM, 2013.

Digital Library

[40]

Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2nd edition, 2003.

Digital Library

[41]

N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD '10, pages 351--362. ACM, 2010.

Digital Library

[42]

N. Sedaghati, T. Mu, L.-N. Pouchet, S. Parthasarathy, and P. Sadayappan. Automatic Selection of Sparse Matrix Representation on GPUs. In Proceedings of the 29th ACM International Conference on Supercomputing, ICS '15. ACM, 2015.

Digital Library

[43]

I.-J. Sung, J. Gómez-Luna, J. M. González-Linares, N. Guil, and W.-M. W. Hwu. In-place Transposition of Rectangular Matrices on Accelerators. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 207--218. ACM, 2014.

Digital Library

[44]

R. Tarjan. Depth-First Search and Linear Graph Algorithms. SIAM Journal on Computing, 1(2):146--160, 1972.

Digital Library

[45]

S. Yan, G. Long, and Y. Zhang. StreamScan: Fast Scan Algorithms for GPUs without Global Barrier Synchronization. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 229--238. ACM, 2013.

Digital Library

[46]

X. Yu, H. Wang, W.-c. Feng, H. Gong, and G. Cao. cuART: Fine-Grained Algebraic Reconstruction Technique for Computed Tomography Images on GPUs. In Proceedings of the 2016 IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid '16. IEEE, 2016.

Digital Library

Cited By

Hu ZSun JLi ZSun G(2024)AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUsACM Transactions on Architecture and Code Optimization10.1145/367491121:4(1-25)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3674911
Zhang GHsu OKjolstad F(2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656426
Li JLi LWang QXue WLiang JShi J(2024)Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architectureParallel Computing10.1016/j.parco.2024.103080120(103080)Online publication date: Jun-2024
https://doi.org/10.1016/j.parco.2024.103080
Show More Cited By

Parallel Transposition of Sparse Data Structures
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis

Recommendations

Explicit Fourth-Order Runge---Kutta Method on Intel Xeon Phi Coprocessor

This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge---Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this ...
Efficient sparse matrix-vector multiplication on x86-based many-core processors
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Sparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks ...
CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
1,932
Total Downloads

Downloads (Last 12 months)448
Downloads (Last 6 weeks)33

Reflects downloads up to 09 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hu ZSun JLi ZSun G(2024)AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUsACM Transactions on Architecture and Code Optimization10.1145/367491121:4(1-25)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3674911
Zhang GHsu OKjolstad F(2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656426
Li JLi LWang QXue WLiang JShi J(2024)Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architectureParallel Computing10.1016/j.parco.2024.103080120(103080)Online publication date: Jun-2024
https://doi.org/10.1016/j.parco.2024.103080
Guo LKestor G(2023)On Higher-performance Sparse Tensor Transposition2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00118(697-701)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00118
Feng SHe XChen KKe LZhang XBlaauw DMudge TDreslinski RSalapura VZahran MChong FTang L(2022)MeNDAProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527432(245-258)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527432
Hidayetoğlu MBiçer Tde Gonzalo SRen BGürsoy DKettimuthu RFoster IHwu W(2022)MemXCT: Design, Optimization, Scaling, and Reproducibility of X-Ray Tomography ImagingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.312803233:9(2014-2031)Online publication date: 1-Sep-2022
https://doi.org/10.1109/TPDS.2021.3128032
Chen ZWang DYu QZheng FGuo FChen Z(2022)AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processorThe Journal of Supercomputing10.1007/s11227-021-04282-678:7(9456-9474)Online publication date: 17-Jan-2022
https://doi.org/10.1007/s11227-021-04282-6
Helal ALaukemann JChecconi FTithi JRanadive TPetrini FChoi JZhou HMoreira JMueller FEtsion Y(2021)ALTOProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3461703(404-416)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3461703
Ahmad NYilmaz BUnat D(2021)A Split Execution Model for SpTRSVIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.307450132:11(2809-2822)Online publication date: 1-Nov-2021
https://doi.org/10.1109/TPDS.2021.3074501
Zhang FSu JLiu WHe BWu RDu XWang R(2021)YuenyeungSpTRSV: A Thread-Level and Warp-Level Fusion Synchronization-Free Sparse Triangular SolveIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306663532:9(2321-2337)Online publication date: 1-Sep-2021
https://doi.org/10.1109/TPDS.2021.3066635
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten