skip to main content
10.1145/3330345.3330366acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Efficient and effective sparse tensor reordering

Published:26 June 2019Publication History

ABSTRACT

This paper formalizes the problem of reordering a sparse tensor to improve the spatial and temporal locality of operations with it, and proposes two reordering algorithms for this problem, which we call BFS-MCS and Lexi-Order. The BFS-MCS method is a Breadth First Search (BFS)-like heuristic approach based on the maximum cardinality search family; Lexi-Order is an extension of doubly lexical ordering of matrices to tensors. We show the effects of these schemes within the context of a widely used tensor computation, the CANDECOMP/PARAFAC decomposition (CPD), when storing the tensor in three previously proposed sparse tensor formats: coordinate (COO), compressed sparse fiber (CSF), and hierarchical coordinate (HiCOO). A new partition-based superblock scheduling is also proposed for HiCOO format to improve load balance. On modern multicore CPUs, we show Lexi-Order obtains up to 4.14× speedup on sequential HiCOO-Mttkrp and 11.88× speedup on its parallel counterpart. The performance of COO- and CSF-based Mttkrps also improves. Our two reordering methods are more effective than state-of-the-art approaches. The code is released as part of Parallel Tensor Infrastructure (ParTI!): https://github.com/hpcgarage/ParTI.

References

  1. M.ín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.Google ScholarGoogle Scholar
  2. K. Akbudak and C. Aykanat. 2017. Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures. IEEE Transactions on Parallel and Distributed Systems 28, 8 (Aug 2017), 2258--2271.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Anandkumar, R. Ge, D. Hsu, ShS.am M. Kakade, and M. Telgarsky. 2014. Tensor Decompositions for Learning Latent Variable Models. J. Mach. Learn. Res. 15, 1 (Jan. 2014), 2773--2832.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. W. Bader and T. G. Kolda. 2007. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing 30, 1 (December 2007), 205--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. W. Bader, T. G. Kolda, et al. 2017. MATLAB Tensor Toolbox (Version 3.0-dev). Available online. https://www.tensortoolbox.orgGoogle ScholarGoogle Scholar
  6. M. Baskaran, T. Henretty, B. Pradelle, M. H. Langston, D. Bruns-Smith, J. Ezick, and R. Lethin. 2017. Memory-efficient parallel tensor decompositions. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.Google ScholarGoogle Scholar
  7. M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. 2012. Efficient and scalable computations with sparse tensors. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on. 1--6.Google ScholarGoogle Scholar
  8. Z. Blanco, B. Liu, and M. M. Dehnavi. 2018. CSTF: Large-Scale Sparse Tensor Factorizations on Distributed Platforms. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 21, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Choi, X. Liu, S. Smith, and T. Simon. 2018. Blocking Optimization Techniques for Sparse Tensor Computation. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 568--577.Google ScholarGoogle Scholar
  10. J. H. Choi and S. Vishwanathan. 2014. DFacTo: Distributed Factorization of Tensors. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger (Eds.). Curran Associates, Inc., 1296--1304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Cichocki. 2014. Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions. CoRR abs/1403.2048 (2014).Google ScholarGoogle Scholar
  12. A. Cichocki, N. Lee, I. V. Oseledets, A. Phan, Q. Zhao, and D. Mandic. 2016. Low-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1. ArXiv e-prints (Sept. 2016). arXiv:cs.NA/1609.00893Google ScholarGoogle Scholar
  13. L. De Lathauwer and D. Nion. 2008. Decompositions of a Higher-Order Tensor in Block Terms---Part III: Alternating Least Squares Algorithms. SIAM J. Matrix Anal. Appl. 30, 3 (2008), 1067--1083. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. C. Ho, J. Ghosh, and J. Sun. 2014. Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14). ACM, New York, NY, USA, 115--124.Google ScholarGoogle Scholar
  15. I. Jeon, E. E. Papalexakis, and C. Faloutsos U Kang. 2015. HaTen2: Billion-scale Tensor Decompositions (Version 1.0). Available from http://datalab.snu.ac.kr/haten2/.Google ScholarGoogle Scholar
  16. U Kang, E. Papalexakis, A. Harpale, and C. Faloutsos. 2012. GigaTensor: Scaling Tensor Analysis Up by 100 Times - Algorithms and Discoveries. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '12). ACM, New York, NY, USA, 316--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Karypis and V. Kumar. 1998. A Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering. J. Parallel and Distrib. Comput. 48, 1 (1998), 71 -- 95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. O. Kaya and B. Uçar. 2015. Scalable Sparse Tensor Decompositions in Distributed Memory Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 77, 11 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Kolda and B. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev. 51, 3 (2009), 455--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jiajia Li. 2018. Scalable tensor decompositions in high performance computing environments. Ph.D. Dissertation. Georgia Institute of Technology, Atlanta, GA, USA.Google ScholarGoogle Scholar
  21. J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc. 2015. An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In ACM/IEEE Super-computing (SC '15). ACM, New York, NY, USA.Google ScholarGoogle Scholar
  22. J. Li, Y. Ma, and R. Vuduc. 2016. ParTI!: A Parallel Tensor Infrastructure for Multicore CPU and GPUs (Version 0.1.0). Available from https://github.com/hpcgarage/ParTI.Google ScholarGoogle Scholar
  23. J. Li, Y. Ma, X. Wu, A. Li, and K. Barker. 2019. PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite. Technical Report.Google ScholarGoogle Scholar
  24. J. Li, J. Sun, and R. Vuduc. 2018. HiCOO: Hierarchical Storage of Sparse Tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 19, 15 pages. http://dl.acm.org/citation.cfm?id=3291656.3291682 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Liu, C. Wen, A. D. Sarwate, and M. M. Dehnavi. 2017. A Unified Optimization Approach for Sparse Tensor Operations on GPUs. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 47--57.Google ScholarGoogle Scholar
  26. A. Lubiw. 1987. Doubly Lexical Orderings of Matrices. SIAM J. Comput. 16, 5 (1987), 854--879. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Ma, J. Li, X. Wu, C. Yan, J. Sun, and R. Vuduc. 2018. Optimizing sparse tensor times matrix on GPUs. J. Parallel and Distrib. Comput. (2018).Google ScholarGoogle Scholar
  28. J. Mellor-Crummey, D. Whalley, and K. Kennedy. 2001. Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings. International Journal of Parallel Programming 29, 3 (01 Jun 2001), 217--247.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. I. Nisa, J. Li, A. Sukumaran-Rajam, R. Vuduc, and P. Sadayappan. 2019. Load-Balanced Sparse MTTKRP on GPUs. arXiv:arXiv:1904.03329Google ScholarGoogle Scholar
  30. A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. 2015. Tensorizing Neural Networks. CoRR abs/1509.06569 (2015).Google ScholarGoogle Scholar
  31. R. Paige and R. E. Tarjan. 1987. Three Partition Refinement Algorithms. SIAM J. Comput. 16, 6 (Dec. 1987), 973--989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. E. E. Papalexakis, C. Faloutsos, and D. D. Sidiropoulos. 2012. ParCube: Sparse Parallelizable Tensor Decompositions. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I (ECML PKDD '12). Springer-Verlag, Berlin, Heidelberg, 521--536.Google ScholarGoogle Scholar
  33. I. Perros, R. Chen, R. Vuduc, and J. Sun. 2015. Sparse Hierarchical Tucker Factorization and its Application to Healthcare. IEEE International Conference on Data Mining (ICDM) (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. I. Perros, E. E. Papalexakis, F. Wang, R. Vuduc, E. Searles, M. Thompson, and J. Sun. 2017. SPARTan: Scalable PARAFAC2 for Large & Sparse Data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, New York, NY, USA, 375--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. C. Pichel, F. F. Rivera, M. Fernández, and A. Rodríguez. 2012. Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs. Microprocessors and Microsystems 36, 2 (2012), 65 -- 77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Pothen and C.-J. Fan. 1990. Computing the Block Triangular Form of a Sparse Matrix. ACM Trans. Math. Softw. 16, 4 (Dec. 1990), 303--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. N. Ravindran, D. D. Sidiropoulos, S. Smith, and G. Karypis. 2014. Memory-Efficient Parallel Computation of Tensor and Matrix Products for Big Tensor Decompositions. Proceedings of the Asilomar Conference on Signals, Systems, and Computers (2014).Google ScholarGoogle Scholar
  38. N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos. 2017. Tensor Decomposition for Signal Processing and Machine Learning. IEEE Transactions on Signal Processing 65, 13 (July 2017), 3551--3582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis. 2017. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. http://frostt.io/Google ScholarGoogle Scholar
  40. S. Smith, K. Huang, N. D. Sidiropoulos, and G. Karypis. {n. d.}. Streaming Tensor Factorization for Infinite Data Sources. 81--89.Google ScholarGoogle Scholar
  41. S. Smith and G. Karypis. 2015. Tensor-Matrix Products with a Compressed Sparse Tensor. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms. ACM, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Smith and G. Karypis. 2016. A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization. In Parallel and Distributed Processing Symposium (IPDPS), 2016 IEEE International. IEEE.Google ScholarGoogle Scholar
  43. S. Smith, J. Park, and G. Karypis. 2016. An Exploration of Optimization Algorithms for High Performance Tensor Completion. Proceedings of the 2016 ACM/IEEE conference on Supercomputing (2016).Google ScholarGoogle Scholar
  44. S. Smith, N. Ravindran, N. Sidiropoulos, and G. Karypis. 2015. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication. In Proceedings of the 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. S. Smith, N. Ravindran, N. Sidiropoulos, and G. Karypis. 2016. SPLATT: The Surprisingly ParalleL spArse Tensor Toolkit (Version 1.1.1). Available from https://github.com/ShadenSmith/splatt.Google ScholarGoogle Scholar
  46. R. E. Tarjan and M. Yannakakis. 1984. Simple Linear-Time Algorithms to Test Chordality of Graphs, Test Acyclicity of Hypergraphs, and Selectively Reduce Acyclic Hypergraphs. SIAM J. Comput. 13, 3 (1984), 566--579. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Y. Wang, R. Chen, J. Ghosh, J. C. Denny, A. Kho, Y. Chen, B. A. Malin, and J. Sun. 2015. Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 1265--1274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. A. J. N. Yzelman and D. Roose. 2014. High-Level Strategies for Parallel Shared-Memory Sparse Matrix-Vector Multiplication. IEEE Transactions on Parallel and Distributed Systems 25, 1 (Jan 2014), 116--125.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    ICS '19: Proceedings of the ACM International Conference on Supercomputing
    June 2019
    533 pages
    ISBN:9781450360791
    DOI:10.1145/3330345

    Copyright © 2019 ACM

    © 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 26 June 2019

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate584of2,055submissions,28%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader