ABSTRACT
Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design-inspired by BLIS-makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g.,a 4 x 4 transpose). HPTT also offers an optional autotuning framework-guided by performance heuristics-that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen's tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous systems. 2015.Google Scholar
- R. J. Bartlett and M. Musiał. Coupled-cluster theory in quantum chemistry. Reviews in Modern Physics, 79(1):291–352, 2007.Google ScholarCross Ref
- S. Chatterjee and S. Sen. Cache-efficient matrix transposition. pages 195–205, 2000.Google Scholar
- 7 Available at www.github.com/springer13/hptt.Google Scholar
- J. Drake, I. Foster, J. Michalakes, B. Toonen, and P. Worley. Design and performance of a scalable parallel community climate model. Parallel Computing, 21(10):1571–1591, 1995. Google ScholarDigital Library
- M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216–231, Feb 2005. ISSN 0018-9219.Google ScholarCross Ref
- G. C. Goldbogen. PRIM: A fast matrix transpose method. IEEE Trans. Software Eng., 7(2):255–257, 1981. Google ScholarDigital Library
- G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.Google Scholar
- R. J. Harrison, G. Beylkin, F. A. Bischoff, J. A. Calvin, G. I. Fann, J. Fosso-Tande, D. Galindo, J. R. Hammond, R. Hartman-Baker, J. C. Hill, J. Jia, J. S. Kottmann, M. Y. Ou, L. E. Ratcliff, M. G. Reuter, A. C. Richie-Halford, N. A. Romero, H. Sekino, W. A. Shelton, B. E. Sundahl, W. S. Thornton, E. F. Valeev, Á. Vázquez-Mayagoitia, N. Vence, and Y. Yokoi. MADNESS: A multiresolution, adaptive numerical environment for scientific simulation. CoRR, abs/1507.01888, 2015.Google Scholar
- A. Hynninen and D. I. Lyakh. cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs. CoRR, abs/1705.01598, 2017. 01598.Google Scholar
- J. L. Jodra, I. Gurrutxaga, and J. Muguerza. Efficient 3D transpositions in graphics processing units. International Journal of Parallel Programming, pages 1–16, 2015. Google ScholarDigital Library
- Q. Lu, S. Krishnamoorthy, and P. Sadayappan. Combining analytical and empirical approaches in tuning matrix transposition. In Proceedings of the 15th international conference on Parallel architectures and compilation techniques, pages 233–242. ACM, 2006. Google ScholarDigital Library
- D. I. Lyakh. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. Computer Physics Communications, 189:84–91, 2015.Google ScholarCross Ref
- G. Mateescu, G. H. Bauer, and R. A. Fiedler. Optimizing matrix transposes using a POWER7 cache model and explicit prefetching. ACM SIGMETRICS Performance Evaluation Review, 40(2):68–73, 2012. Google ScholarDigital Library
- J. McCalpin and M. Smotherman. Automatic benchmark generation for cache optimization of matrix operations. In Proceedings of the 33rd annual on Southeast regional conference, pages 195–204. ACM, 1995. Google ScholarDigital Library
- J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pages 19–25, Dec. 1995.Google Scholar
- D. Pekurovsky. P3DFFT: A framework for parallel computations of Fourier transforms in three dimensions. SIAM Journal on Scientific Computing, 34(4):C192–C209, 2012.Google ScholarCross Ref
- K. Raghavachari, G. W. Trucks, J. A. Pople, and M. Head-Gordon. A fifth-order perturbation comparison of electron correlation theories. Chemical Physics Letters, 157(6):479–483, 1989.Google ScholarCross Ref
- E. Solomonik, D. Matthews, J. Hammond, and J. Demmel. Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 813–824, May 2013. Google ScholarDigital Library
- P. Springer and P. Bientinesi. Design of a high-performance GEMMlike Tensor-Tensor Multiplication. CoRR, 2016.Google Scholar
- P. Springer, J. R. Hammond, and P. Bientinesi. TTC: A highperformance compiler for tensor transpositions. CoRR, 2016.Google Scholar
- P. Springer, A. Sankaran, and P. Bientinesi. TTC: A Tensor Transposition Compiler for Multiple Architectures. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY 2016, pages 41– 46, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4384-8. Google ScholarDigital Library
- M. van Heel. A fast algorithm for transposing large multidimensional image data sets. Ultramicroscopy, 38(1):75–83, 1991.Google ScholarCross Ref
- F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software, 41(3):14:1–14:33, 2015. Google ScholarDigital Library
- N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation, 2014.Google Scholar
- A. Vladimirov. Multithreaded transposition of square matrices with common code for Intel Xeon processors and Intel Xeon Phi coprocessors, 2013.Google Scholar
- pdf.Google Scholar
- L. Wei and J. Mellor-Crummey. Autotuning tensor transposition. In Parallel & Distributed Processing Symposium Workshops (IPDPSW), IEEE International, pages 342–351. IEEE, 2014. Google ScholarDigital Library
Index Terms
HPTT: a high-performance tensor transposition C++ library
Recommendations
TTC: a tensor transposition compiler for multiple architectures
ARRAY 2016: Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array ProgrammingWe consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system's peak memory ...
Design of a High-Performance GEMM-like Tensor–Tensor Multiplication
We present “GEMM-like Tensor–Tensor multiplication” (GETT), a novel approach for dense tensor contractions that mirrors the design of a high-performance general matrix–matrix multiplication (GEMM). The critical insight behind GETT is the identification ...
Automated performance tuning
PASCO '10: Proceedings of the 4th International Workshop on Parallel and Symbolic ComputationThis tutorial presents automated techniques for implementing and optimizing numeric and symbolic libraries on modern computing platforms including SSE, multicore, and GPU. Obtaining high performance requires effective use of the memory hierarchy, short ...
Comments