ABSTRACT
This paper formalizes the problem of reordering a sparse tensor to improve the spatial and temporal locality of operations with it, and proposes two reordering algorithms for this problem, which we call BFS-MCS and Lexi-Order. The BFS-MCS method is a Breadth First Search (BFS)-like heuristic approach based on the maximum cardinality search family; Lexi-Order is an extension of doubly lexical ordering of matrices to tensors. We show the effects of these schemes within the context of a widely used tensor computation, the CANDECOMP/PARAFAC decomposition (CPD), when storing the tensor in three previously proposed sparse tensor formats: coordinate (COO), compressed sparse fiber (CSF), and hierarchical coordinate (HiCOO). A new partition-based superblock scheduling is also proposed for HiCOO format to improve load balance. On modern multicore CPUs, we show Lexi-Order obtains up to 4.14× speedup on sequential HiCOO-Mttkrp and 11.88× speedup on its parallel counterpart. The performance of COO- and CSF-based Mttkrps also improves. Our two reordering methods are more effective than state-of-the-art approaches. The code is released as part of Parallel Tensor Infrastructure (ParTI!): https://github.com/hpcgarage/ParTI.
- M.ín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.Google Scholar
- K. Akbudak and C. Aykanat. 2017. Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures. IEEE Transactions on Parallel and Distributed Systems 28, 8 (Aug 2017), 2258--2271.Google ScholarDigital Library
- A. Anandkumar, R. Ge, D. Hsu, ShS.am M. Kakade, and M. Telgarsky. 2014. Tensor Decompositions for Learning Latent Variable Models. J. Mach. Learn. Res. 15, 1 (Jan. 2014), 2773--2832.Google ScholarDigital Library
- B. W. Bader and T. G. Kolda. 2007. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing 30, 1 (December 2007), 205--231. Google ScholarDigital Library
- B. W. Bader, T. G. Kolda, et al. 2017. MATLAB Tensor Toolbox (Version 3.0-dev). Available online. https://www.tensortoolbox.orgGoogle Scholar
- M. Baskaran, T. Henretty, B. Pradelle, M. H. Langston, D. Bruns-Smith, J. Ezick, and R. Lethin. 2017. Memory-efficient parallel tensor decompositions. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.Google Scholar
- M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. 2012. Efficient and scalable computations with sparse tensors. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on. 1--6.Google Scholar
- Z. Blanco, B. Liu, and M. M. Dehnavi. 2018. CSTF: Large-Scale Sparse Tensor Factorizations on Distributed Platforms. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 21, 10 pages. Google ScholarDigital Library
- J. Choi, X. Liu, S. Smith, and T. Simon. 2018. Blocking Optimization Techniques for Sparse Tensor Computation. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 568--577.Google Scholar
- J. H. Choi and S. Vishwanathan. 2014. DFacTo: Distributed Factorization of Tensors. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger (Eds.). Curran Associates, Inc., 1296--1304. Google ScholarDigital Library
- A. Cichocki. 2014. Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions. CoRR abs/1403.2048 (2014).Google Scholar
- A. Cichocki, N. Lee, I. V. Oseledets, A. Phan, Q. Zhao, and D. Mandic. 2016. Low-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1. ArXiv e-prints (Sept. 2016). arXiv:cs.NA/1609.00893Google Scholar
- L. De Lathauwer and D. Nion. 2008. Decompositions of a Higher-Order Tensor in Block Terms---Part III: Alternating Least Squares Algorithms. SIAM J. Matrix Anal. Appl. 30, 3 (2008), 1067--1083. Google ScholarDigital Library
- J. C. Ho, J. Ghosh, and J. Sun. 2014. Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14). ACM, New York, NY, USA, 115--124.Google Scholar
- I. Jeon, E. E. Papalexakis, and C. Faloutsos U Kang. 2015. HaTen2: Billion-scale Tensor Decompositions (Version 1.0). Available from http://datalab.snu.ac.kr/haten2/.Google Scholar
- U Kang, E. Papalexakis, A. Harpale, and C. Faloutsos. 2012. GigaTensor: Scaling Tensor Analysis Up by 100 Times - Algorithms and Discoveries. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '12). ACM, New York, NY, USA, 316--324. Google ScholarDigital Library
- G. Karypis and V. Kumar. 1998. A Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering. J. Parallel and Distrib. Comput. 48, 1 (1998), 71 -- 95. Google ScholarDigital Library
- O. Kaya and B. Uçar. 2015. Scalable Sparse Tensor Decompositions in Distributed Memory Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 77, 11 pages. Google ScholarDigital Library
- T. Kolda and B. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev. 51, 3 (2009), 455--500. Google ScholarDigital Library
- Jiajia Li. 2018. Scalable tensor decompositions in high performance computing environments. Ph.D. Dissertation. Georgia Institute of Technology, Atlanta, GA, USA.Google Scholar
- J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc. 2015. An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In ACM/IEEE Super-computing (SC '15). ACM, New York, NY, USA.Google Scholar
- J. Li, Y. Ma, and R. Vuduc. 2016. ParTI!: A Parallel Tensor Infrastructure for Multicore CPU and GPUs (Version 0.1.0). Available from https://github.com/hpcgarage/ParTI.Google Scholar
- J. Li, Y. Ma, X. Wu, A. Li, and K. Barker. 2019. PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite. Technical Report.Google Scholar
- J. Li, J. Sun, and R. Vuduc. 2018. HiCOO: Hierarchical Storage of Sparse Tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 19, 15 pages. http://dl.acm.org/citation.cfm?id=3291656.3291682 Google ScholarDigital Library
- B. Liu, C. Wen, A. D. Sarwate, and M. M. Dehnavi. 2017. A Unified Optimization Approach for Sparse Tensor Operations on GPUs. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 47--57.Google Scholar
- A. Lubiw. 1987. Doubly Lexical Orderings of Matrices. SIAM J. Comput. 16, 5 (1987), 854--879. Google ScholarDigital Library
- Y. Ma, J. Li, X. Wu, C. Yan, J. Sun, and R. Vuduc. 2018. Optimizing sparse tensor times matrix on GPUs. J. Parallel and Distrib. Comput. (2018).Google Scholar
- J. Mellor-Crummey, D. Whalley, and K. Kennedy. 2001. Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings. International Journal of Parallel Programming 29, 3 (01 Jun 2001), 217--247.Google ScholarDigital Library
- I. Nisa, J. Li, A. Sukumaran-Rajam, R. Vuduc, and P. Sadayappan. 2019. Load-Balanced Sparse MTTKRP on GPUs. arXiv:arXiv:1904.03329Google Scholar
- A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. 2015. Tensorizing Neural Networks. CoRR abs/1509.06569 (2015).Google Scholar
- R. Paige and R. E. Tarjan. 1987. Three Partition Refinement Algorithms. SIAM J. Comput. 16, 6 (Dec. 1987), 973--989. Google ScholarDigital Library
- E. E. Papalexakis, C. Faloutsos, and D. D. Sidiropoulos. 2012. ParCube: Sparse Parallelizable Tensor Decompositions. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I (ECML PKDD '12). Springer-Verlag, Berlin, Heidelberg, 521--536.Google Scholar
- I. Perros, R. Chen, R. Vuduc, and J. Sun. 2015. Sparse Hierarchical Tucker Factorization and its Application to Healthcare. IEEE International Conference on Data Mining (ICDM) (2015). Google ScholarDigital Library
- I. Perros, E. E. Papalexakis, F. Wang, R. Vuduc, E. Searles, M. Thompson, and J. Sun. 2017. SPARTan: Scalable PARAFAC2 for Large & Sparse Data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, New York, NY, USA, 375--384. Google ScholarDigital Library
- J. C. Pichel, F. F. Rivera, M. Fernández, and A. Rodríguez. 2012. Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs. Microprocessors and Microsystems 36, 2 (2012), 65 -- 77. Google ScholarDigital Library
- A. Pothen and C.-J. Fan. 1990. Computing the Block Triangular Form of a Sparse Matrix. ACM Trans. Math. Softw. 16, 4 (Dec. 1990), 303--324. Google ScholarDigital Library
- N. Ravindran, D. D. Sidiropoulos, S. Smith, and G. Karypis. 2014. Memory-Efficient Parallel Computation of Tensor and Matrix Products for Big Tensor Decompositions. Proceedings of the Asilomar Conference on Signals, Systems, and Computers (2014).Google Scholar
- N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos. 2017. Tensor Decomposition for Signal Processing and Machine Learning. IEEE Transactions on Signal Processing 65, 13 (July 2017), 3551--3582. Google ScholarDigital Library
- S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis. 2017. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. http://frostt.io/Google Scholar
- S. Smith, K. Huang, N. D. Sidiropoulos, and G. Karypis. {n. d.}. Streaming Tensor Factorization for Infinite Data Sources. 81--89.Google Scholar
- S. Smith and G. Karypis. 2015. Tensor-Matrix Products with a Compressed Sparse Tensor. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms. ACM, 7. Google ScholarDigital Library
- S. Smith and G. Karypis. 2016. A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization. In Parallel and Distributed Processing Symposium (IPDPS), 2016 IEEE International. IEEE.Google Scholar
- S. Smith, J. Park, and G. Karypis. 2016. An Exploration of Optimization Algorithms for High Performance Tensor Completion. Proceedings of the 2016 ACM/IEEE conference on Supercomputing (2016).Google Scholar
- S. Smith, N. Ravindran, N. Sidiropoulos, and G. Karypis. 2015. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication. In Proceedings of the 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS). Google ScholarDigital Library
- S. Smith, N. Ravindran, N. Sidiropoulos, and G. Karypis. 2016. SPLATT: The Surprisingly ParalleL spArse Tensor Toolkit (Version 1.1.1). Available from https://github.com/ShadenSmith/splatt.Google Scholar
- R. E. Tarjan and M. Yannakakis. 1984. Simple Linear-Time Algorithms to Test Chordality of Graphs, Test Acyclicity of Hypergraphs, and Selectively Reduce Acyclic Hypergraphs. SIAM J. Comput. 13, 3 (1984), 566--579. Google ScholarDigital Library
- Y. Wang, R. Chen, J. Ghosh, J. C. Denny, A. Kho, Y. Chen, B. A. Malin, and J. Sun. 2015. Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 1265--1274. Google ScholarDigital Library
- A. J. N. Yzelman and D. Roose. 2014. High-Level Strategies for Parallel Shared-Memory Sparse Matrix-Vector Multiplication. IEEE Transactions on Parallel and Distributed Systems 25, 1 (Jan 2014), 116--125.Google ScholarDigital Library
Recommendations
Sparse tensor storage by tensor unfolding
SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied ComputingLarge scale high dimensional multi-aspect data are naturally represented as tensors or multi-way arrays for scientific and engineering computations. But in practical applications such as signal processing, data mining, computer vision, and graph ...
Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition
Tensors are used by a wide variety of applications to represent multi-dimensional data; tensor decompositions are a class of methods for latent data analytics, data compression, and so on. Many of these applications generate large tensors with irregular ...
Sparse nonnegative tensor decomposition using proximal algorithm and inexact block coordinate descent scheme
AbstractNonnegative tensor decomposition is a versatile tool for multiway data analysis, by which the extracted components are nonnegative and usually sparse. Nevertheless, the sparsity is only a side effect and cannot be explicitly controlled without ...
Comments