research-article

Public Access

Efficient and effective sparse tensor reordering

Authors:
Jiajia Li

Pacific Northwest National Laboratory

Pacific Northwest National Laboratory
View Profile

,
Bora Uçar

CNRS and ENS Lyon, France

CNRS and ENS Lyon, France
View Profile

,
Ümit V. Çatalyürek

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Jimeng Sun

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Kevin Barker

Pacific Northwest National Laboratory

Pacific Northwest National Laboratory
View Profile

,
Richard Vuduc

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

ICS '19: Proceedings of the ACM International Conference on SupercomputingJune 2019Pages 227–237https://doi.org/10.1145/3330345.3330366

Published:26 June 2019Publication History

ICS '19: Proceedings of the ACM International Conference on Supercomputing

Pages 227–237

ABSTRACT

This paper formalizes the problem of reordering a sparse tensor to improve the spatial and temporal locality of operations with it, and proposes two reordering algorithms for this problem, which we call BFS-MCS and Lexi-Order. The BFS-MCS method is a Breadth First Search (BFS)-like heuristic approach based on the maximum cardinality search family; Lexi-Order is an extension of doubly lexical ordering of matrices to tensors. We show the effects of these schemes within the context of a widely used tensor computation, the CANDECOMP/PARAFAC decomposition (CPD), when storing the tensor in three previously proposed sparse tensor formats: coordinate (COO), compressed sparse fiber (CSF), and hierarchical coordinate (HiCOO). A new partition-based superblock scheduling is also proposed for HiCOO format to improve load balance. On modern multicore CPUs, we show Lexi-Order obtains up to 4.14× speedup on sequential HiCOO-Mttkrp and 11.88× speedup on its parallel counterpart. The performance of COO- and CSF-based Mttkrps also improves. Our two reordering methods are more effective than state-of-the-art approaches. The code is released as part of Parallel Tensor Infrastructure (ParTI!): https://github.com/hpcgarage/ParTI.

References

M.ín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.Google Scholar
K. Akbudak and C. Aykanat. 2017. Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures. IEEE Transactions on Parallel and Distributed Systems 28, 8 (Aug 2017), 2258--2271.Google ScholarDigital Library
A. Anandkumar, R. Ge, D. Hsu, ShS.am M. Kakade, and M. Telgarsky. 2014. Tensor Decompositions for Learning Latent Variable Models. J. Mach. Learn. Res. 15, 1 (Jan. 2014), 2773--2832.Google ScholarDigital Library
B. W. Bader and T. G. Kolda. 2007. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing 30, 1 (December 2007), 205--231. Google ScholarDigital Library
B. W. Bader, T. G. Kolda, et al. 2017. MATLAB Tensor Toolbox (Version 3.0-dev). Available online. https://www.tensortoolbox.orgGoogle Scholar
M. Baskaran, T. Henretty, B. Pradelle, M. H. Langston, D. Bruns-Smith, J. Ezick, and R. Lethin. 2017. Memory-efficient parallel tensor decompositions. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.Google Scholar
M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. 2012. Efficient and scalable computations with sparse tensors. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on. 1--6.Google Scholar
Z. Blanco, B. Liu, and M. M. Dehnavi. 2018. CSTF: Large-Scale Sparse Tensor Factorizations on Distributed Platforms. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 21, 10 pages. Google ScholarDigital Library
J. Choi, X. Liu, S. Smith, and T. Simon. 2018. Blocking Optimization Techniques for Sparse Tensor Computation. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 568--577.Google Scholar
J. H. Choi and S. Vishwanathan. 2014. DFacTo: Distributed Factorization of Tensors. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger (Eds.). Curran Associates, Inc., 1296--1304. Google ScholarDigital Library
A. Cichocki. 2014. Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions. CoRR abs/1403.2048 (2014).Google Scholar
A. Cichocki, N. Lee, I. V. Oseledets, A. Phan, Q. Zhao, and D. Mandic. 2016. Low-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1. ArXiv e-prints (Sept. 2016). arXiv:cs.NA/1609.00893Google Scholar
L. De Lathauwer and D. Nion. 2008. Decompositions of a Higher-Order Tensor in Block Terms---Part III: Alternating Least Squares Algorithms. SIAM J. Matrix Anal. Appl. 30, 3 (2008), 1067--1083. Google ScholarDigital Library
J. C. Ho, J. Ghosh, and J. Sun. 2014. Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14). ACM, New York, NY, USA, 115--124.Google Scholar
I. Jeon, E. E. Papalexakis, and C. Faloutsos U Kang. 2015. HaTen2: Billion-scale Tensor Decompositions (Version 1.0). Available from http://datalab.snu.ac.kr/haten2/.Google Scholar
U Kang, E. Papalexakis, A. Harpale, and C. Faloutsos. 2012. GigaTensor: Scaling Tensor Analysis Up by 100 Times - Algorithms and Discoveries. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '12). ACM, New York, NY, USA, 316--324. Google ScholarDigital Library
G. Karypis and V. Kumar. 1998. A Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering. J. Parallel and Distrib. Comput. 48, 1 (1998), 71 -- 95. Google ScholarDigital Library
O. Kaya and B. Uçar. 2015. Scalable Sparse Tensor Decompositions in Distributed Memory Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 77, 11 pages. Google ScholarDigital Library
T. Kolda and B. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev. 51, 3 (2009), 455--500. Google ScholarDigital Library
Jiajia Li. 2018. Scalable tensor decompositions in high performance computing environments. Ph.D. Dissertation. Georgia Institute of Technology, Atlanta, GA, USA.Google Scholar
J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc. 2015. An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In ACM/IEEE Super-computing (SC '15). ACM, New York, NY, USA.Google Scholar
J. Li, Y. Ma, and R. Vuduc. 2016. ParTI!: A Parallel Tensor Infrastructure for Multicore CPU and GPUs (Version 0.1.0). Available from https://github.com/hpcgarage/ParTI.Google Scholar
J. Li, Y. Ma, X. Wu, A. Li, and K. Barker. 2019. PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite. Technical Report.Google Scholar
J. Li, J. Sun, and R. Vuduc. 2018. HiCOO: Hierarchical Storage of Sparse Tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 19, 15 pages. http://dl.acm.org/citation.cfm?id=3291656.3291682 Google ScholarDigital Library
B. Liu, C. Wen, A. D. Sarwate, and M. M. Dehnavi. 2017. A Unified Optimization Approach for Sparse Tensor Operations on GPUs. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 47--57.Google Scholar
A. Lubiw. 1987. Doubly Lexical Orderings of Matrices. SIAM J. Comput. 16, 5 (1987), 854--879. Google ScholarDigital Library
Y. Ma, J. Li, X. Wu, C. Yan, J. Sun, and R. Vuduc. 2018. Optimizing sparse tensor times matrix on GPUs. J. Parallel and Distrib. Comput. (2018).Google Scholar
J. Mellor-Crummey, D. Whalley, and K. Kennedy. 2001. Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings. International Journal of Parallel Programming 29, 3 (01 Jun 2001), 217--247.Google ScholarDigital Library
I. Nisa, J. Li, A. Sukumaran-Rajam, R. Vuduc, and P. Sadayappan. 2019. Load-Balanced Sparse MTTKRP on GPUs. arXiv:arXiv:1904.03329Google Scholar
A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. 2015. Tensorizing Neural Networks. CoRR abs/1509.06569 (2015).Google Scholar
R. Paige and R. E. Tarjan. 1987. Three Partition Refinement Algorithms. SIAM J. Comput. 16, 6 (Dec. 1987), 973--989. Google ScholarDigital Library
E. E. Papalexakis, C. Faloutsos, and D. D. Sidiropoulos. 2012. ParCube: Sparse Parallelizable Tensor Decompositions. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I (ECML PKDD '12). Springer-Verlag, Berlin, Heidelberg, 521--536.Google Scholar
I. Perros, R. Chen, R. Vuduc, and J. Sun. 2015. Sparse Hierarchical Tucker Factorization and its Application to Healthcare. IEEE International Conference on Data Mining (ICDM) (2015). Google ScholarDigital Library
I. Perros, E. E. Papalexakis, F. Wang, R. Vuduc, E. Searles, M. Thompson, and J. Sun. 2017. SPARTan: Scalable PARAFAC2 for Large & Sparse Data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, New York, NY, USA, 375--384. Google ScholarDigital Library
J. C. Pichel, F. F. Rivera, M. Fernández, and A. Rodríguez. 2012. Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs. Microprocessors and Microsystems 36, 2 (2012), 65 -- 77. Google ScholarDigital Library
A. Pothen and C.-J. Fan. 1990. Computing the Block Triangular Form of a Sparse Matrix. ACM Trans. Math. Softw. 16, 4 (Dec. 1990), 303--324. Google ScholarDigital Library
N. Ravindran, D. D. Sidiropoulos, S. Smith, and G. Karypis. 2014. Memory-Efficient Parallel Computation of Tensor and Matrix Products for Big Tensor Decompositions. Proceedings of the Asilomar Conference on Signals, Systems, and Computers (2014).Google Scholar
N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos. 2017. Tensor Decomposition for Signal Processing and Machine Learning. IEEE Transactions on Signal Processing 65, 13 (July 2017), 3551--3582. Google ScholarDigital Library
S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis. 2017. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. http://frostt.io/Google Scholar
S. Smith, K. Huang, N. D. Sidiropoulos, and G. Karypis. {n. d.}. Streaming Tensor Factorization for Infinite Data Sources. 81--89.Google Scholar
S. Smith and G. Karypis. 2015. Tensor-Matrix Products with a Compressed Sparse Tensor. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms. ACM, 7. Google ScholarDigital Library
S. Smith and G. Karypis. 2016. A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization. In Parallel and Distributed Processing Symposium (IPDPS), 2016 IEEE International. IEEE.Google Scholar
S. Smith, J. Park, and G. Karypis. 2016. An Exploration of Optimization Algorithms for High Performance Tensor Completion. Proceedings of the 2016 ACM/IEEE conference on Supercomputing (2016).Google Scholar
S. Smith, N. Ravindran, N. Sidiropoulos, and G. Karypis. 2015. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication. In Proceedings of the 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS). Google ScholarDigital Library
S. Smith, N. Ravindran, N. Sidiropoulos, and G. Karypis. 2016. SPLATT: The Surprisingly ParalleL spArse Tensor Toolkit (Version 1.1.1). Available from https://github.com/ShadenSmith/splatt.Google Scholar
R. E. Tarjan and M. Yannakakis. 1984. Simple Linear-Time Algorithms to Test Chordality of Graphs, Test Acyclicity of Hypergraphs, and Selectively Reduce Acyclic Hypergraphs. SIAM J. Comput. 13, 3 (1984), 566--579. Google ScholarDigital Library
Y. Wang, R. Chen, J. Ghosh, J. C. Denny, A. Kho, Y. Chen, B. A. Malin, and J. Sun. 2015. Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 1265--1274. Google ScholarDigital Library
A. J. N. Yzelman and D. Roose. 2014. High-Level Strategies for Parallel Shared-Memory Sparse Matrix-Vector Multiplication. IEEE Transactions on Parallel and Distributed Systems 25, 1 (Jan 2014), 116--125.Google ScholarDigital Library

Recommendations

Sparse tensor storage by tensor unfolding
SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing

Large scale high dimensional multi-aspect data are naturally represented as tensors or multi-way arrays for scientific and engineering computations. But in practical applications such as signal processing, data mining, computer vision, and graph ...
Read More
Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition
Tensors are used by a wide variety of applications to represent multi-dimensional data; tensor decompositions are a class of methods for latent data analytics, data compression, and so on. Many of these applications generate large tensors with irregular ...
Read More
Sparse nonnegative tensor decomposition using proximal algorithm and inexact block coordinate descent scheme
Abstract
Nonnegative tensor decomposition is a versatile tool for multiway data analysis, by which the extracted components are nonnegative and usually sparse. Nevertheless, the sparsity is only a side effect and cannot be explicitly controlled without ... $_{}$
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '19: Proceedings of the ACM International Conference on Supercomputing
June 2019
533 pages
ISBN:9781450360791
DOI:10.1145/3330345
General Chair:
Rudolf Eigenmann
University of Delaware
,
Program Chairs:
Chen Ding
University of Rochester
,
Sally A. McKee
Clemson University
Copyright © 2019 ACM
© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
HiCOO
hierarchical coordinate
reordering
sparse tensor
tensor decomposition
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 25
  Total Citations
  View Citations
- 842
  Total Downloads
- Downloads (Last 12 months)139
- Downloads (Last 6 weeks)34
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient and effective sparse tensor reordering

ICS '19: Proceedings of the ACM International Conference on Supercomputing

ABSTRACT

References

Cited By

Recommendations

Sparse tensor storage by tensor unfolding

Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition

Sparse nonnegative tensor decomposition using proximal algorithm and inexact block coordinate descent scheme

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient and effective sparse tensor reordering

ICS '19: Proceedings of the ACM International Conference on Supercomputing

ABSTRACT

References

Cited By

Recommendations

Sparse tensor storage by tensor unfolding

Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition

Sparse nonnegative tensor decomposition using proximal algorithm and inexact block coordinate descent scheme

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media