skip to main content
10.1145/3091966.3091968acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

HPTT: a high-performance tensor transposition C++ library

Published:18 June 2017Publication History

ABSTRACT

Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design-inspired by BLIS-makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g.,a 4 x 4 transpose). HPTT also offers an optional autotuning framework-guided by performance heuristics-that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen's tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.

References

  1. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous systems. 2015.Google ScholarGoogle Scholar
  2. R. J. Bartlett and M. Musiał. Coupled-cluster theory in quantum chemistry. Reviews in Modern Physics, 79(1):291–352, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  3. S. Chatterjee and S. Sen. Cache-efficient matrix transposition. pages 195–205, 2000.Google ScholarGoogle Scholar
  4. 7 Available at www.github.com/springer13/hptt.Google ScholarGoogle Scholar
  5. J. Drake, I. Foster, J. Michalakes, B. Toonen, and P. Worley. Design and performance of a scalable parallel community climate model. Parallel Computing, 21(10):1571–1591, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216–231, Feb 2005. ISSN 0018-9219.Google ScholarGoogle ScholarCross RefCross Ref
  7. G. C. Goldbogen. PRIM: A fast matrix transpose method. IEEE Trans. Software Eng., 7(2):255–257, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.Google ScholarGoogle Scholar
  9. R. J. Harrison, G. Beylkin, F. A. Bischoff, J. A. Calvin, G. I. Fann, J. Fosso-Tande, D. Galindo, J. R. Hammond, R. Hartman-Baker, J. C. Hill, J. Jia, J. S. Kottmann, M. Y. Ou, L. E. Ratcliff, M. G. Reuter, A. C. Richie-Halford, N. A. Romero, H. Sekino, W. A. Shelton, B. E. Sundahl, W. S. Thornton, E. F. Valeev, Á. Vázquez-Mayagoitia, N. Vence, and Y. Yokoi. MADNESS: A multiresolution, adaptive numerical environment for scientific simulation. CoRR, abs/1507.01888, 2015.Google ScholarGoogle Scholar
  10. A. Hynninen and D. I. Lyakh. cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs. CoRR, abs/1705.01598, 2017. 01598.Google ScholarGoogle Scholar
  11. J. L. Jodra, I. Gurrutxaga, and J. Muguerza. Efficient 3D transpositions in graphics processing units. International Journal of Parallel Programming, pages 1–16, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Q. Lu, S. Krishnamoorthy, and P. Sadayappan. Combining analytical and empirical approaches in tuning matrix transposition. In Proceedings of the 15th international conference on Parallel architectures and compilation techniques, pages 233–242. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. I. Lyakh. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. Computer Physics Communications, 189:84–91, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  14. G. Mateescu, G. H. Bauer, and R. A. Fiedler. Optimizing matrix transposes using a POWER7 cache model and explicit prefetching. ACM SIGMETRICS Performance Evaluation Review, 40(2):68–73, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. McCalpin and M. Smotherman. Automatic benchmark generation for cache optimization of matrix operations. In Proceedings of the 33rd annual on Southeast regional conference, pages 195–204. ACM, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pages 19–25, Dec. 1995.Google ScholarGoogle Scholar
  17. D. Pekurovsky. P3DFFT: A framework for parallel computations of Fourier transforms in three dimensions. SIAM Journal on Scientific Computing, 34(4):C192–C209, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  18. K. Raghavachari, G. W. Trucks, J. A. Pople, and M. Head-Gordon. A fifth-order perturbation comparison of electron correlation theories. Chemical Physics Letters, 157(6):479–483, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  19. E. Solomonik, D. Matthews, J. Hammond, and J. Demmel. Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 813–824, May 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Springer and P. Bientinesi. Design of a high-performance GEMMlike Tensor-Tensor Multiplication. CoRR, 2016.Google ScholarGoogle Scholar
  21. P. Springer, J. R. Hammond, and P. Bientinesi. TTC: A highperformance compiler for tensor transpositions. CoRR, 2016.Google ScholarGoogle Scholar
  22. P. Springer, A. Sankaran, and P. Bientinesi. TTC: A Tensor Transposition Compiler for Multiple Architectures. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY 2016, pages 41– 46, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4384-8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. van Heel. A fast algorithm for transposing large multidimensional image data sets. Ultramicroscopy, 38(1):75–83, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  24. F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software, 41(3):14:1–14:33, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation, 2014.Google ScholarGoogle Scholar
  26. A. Vladimirov. Multithreaded transposition of square matrices with common code for Intel Xeon processors and Intel Xeon Phi coprocessors, 2013.Google ScholarGoogle Scholar
  27. pdf.Google ScholarGoogle Scholar
  28. L. Wei and J. Mellor-Crummey. Autotuning tensor transposition. In Parallel & Distributed Processing Symposium Workshops (IPDPSW), IEEE International, pages 342–351. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. HPTT: a high-performance tensor transposition C++ library

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              ARRAY 2017: Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming
              June 2017
              62 pages
              ISBN:9781450350693
              DOI:10.1145/3091966

              Copyright © 2017 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 18 June 2017

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate17of25submissions,68%

              Upcoming Conference

              PLDI '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader