Abstract
BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The systems for which we demonstrate the framework include state-of-the-art general-purpose, low-power, and many-core architectures. We show, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS (an effort to maintain and extend the GotoBLAS), and commercial vendor implementations such as AMD’s ACML, IBM’s ESSL, and Intel’s MKL libraries. Although most of this article focuses on single-core implementation, we also provide compelling results that suggest the framework’s leverage extends to the multithreaded domain.
- Murtaza Ali, Eric Stotzer, Francisco D. Igual, and Robert A. van de Geijn. 2012. Level-3 BLAS on the TI C6678 multi-core DSP. In Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’12). 179--186. DOI:http://dx.doi.org/10.1109/SBAC-PAD.2012.26 Google ScholarDigital Library
- E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. 1999. LAPACK Users’ Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarDigital Library
- ATLAS. 2013. ATLAS 3.8.4 ARM. Retrieved April 4, 2016, from http://www.vesperix.com/arm/atlas-arm/index.html.Google Scholar
- Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software 16, 1, 1--17. Google ScholarDigital Library
- Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Transactions on Mathematical Software 14, 1, 1--17. Google ScholarDigital Library
- Freescale Semiconductor. 1999. AltiVec Technology Programming Interface Manual. Retrieved April 4, 2016, from, http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf.Google Scholar
- Kazushige Goto and Robert van de Geijn. 2008a. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software 34, 3, 12:1--12:25. Google ScholarDigital Library
- Kazushige Goto and Robert van de Geijn. 2008b. High-performance implementation of the level-3 BLAS. ACM Transactions on Mathematical Software 35, 1, 1--14. Google ScholarDigital Library
- Michael Gschwind. 2012. Blue Gene/Q: Design for sustained multi-petaflop computing. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 245--246. DOI:http://dx.doi.org/10.1145/2304576.2304609 Google ScholarDigital Library
- John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. 2001a. A family of high-performance matrix multiplication algorithms. In Computational Science—ICCS 2001. Lecture Notes in Computer Science, Vol. 2073. Springer, 51--60. Google ScholarDigital Library
- John A. Gunnels, Robert A. van de Geijn, Daniel S. Katz, and Enrique S. Quintana-Orti. 2001b. Fault-tolerant high-performance matrix multiplication: Theory and practice. In Proceedings of the 2001 International Conference on Dependable Systems and Networks (DSN’01). IEEE, Los Alamitos, CA, 47--56. Google ScholarDigital Library
- Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G. Shet, George Chrysos, and Pradeep Dubey. 2013. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor. In Proceedings of the 27th IEEE International Parallel & Distributed Processing Symposium (IPDPS’’13). Google ScholarDigital Library
- K. Huang and J. A. Abraham. 1984. Algorithm--based fault tolerance for matrix operations. IEEE Transactions on Computers 33, 6, 518--528. Google ScholarDigital Library
- IBM Blue Gene Team. 2013. Design of the IBM Blue Gene/Q compute chip. IBM Journal of Research and Development 57, 1/2, 1:1--1:13. DOI:http://dx.doi.org/10.1147/JRD.2012.2222991 Google ScholarDigital Library
- Francisco D. Igual, Murtaza Ali, Arnon Friedmann, Eric Stotzer, Timothy Wentz, and Robert A. van de Geijn. 2012. Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’12). IEEE, Los Alamitos, CA, 26:1--26:11. http://dl.acm.org/citation.cfm?id=2388996.2389032 Google ScholarDigital Library
- B. Kågström, P. Ling, and C. Van Loan. 1998. GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark. ACM Transactions on Mathematical Software 24, 3, 268--302. Google ScholarDigital Library
- C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Transactions on Mathematical Software 5, 3, 308--323. Google ScholarDigital Library
- Loongson Technology. 2009. Loongson 3A Processor Manual. Loongson Technology Corp. Ltd.Google Scholar
- Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Ortí. 2014. Analytical Modeling Is Enough for High Performance BLIS. Technical Report. Department of Computer Sciences, University of Texas at Austin.Google Scholar
- OpenBLAS. 2012. OpenBLAS Home Page. Retrieved April 4, 2016, from http://xianyi.github.com/OpenBLAS/.Google Scholar
- OpenMP Architecture Review Board. 2008. OpenMP Application Program Interface Version 3.0. Retrieved April 4, 2016, from http://www.openmp.org/mp-documents/spec30.pdf.Google Scholar
- Ardavan Pedram, Andreas Gerstlauer, and Robert A. van de Geijn. 2012a. On the efficiency of register file versus broadcast interconnect for collective communications in data-parallel hardware accelerators. In Proceedings of the 2012 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’12). Google ScholarDigital Library
- Ardavan Pedram, Robert A. van de Geijn, and Andreas Gerstlauer. 2012b. Codesign tradeoffs for high-performance, low-power linear algebra architectures. IEEE Transactions on Computers 61, 1724--1736. DOI:http://dx.doi.org/10.1109/TC.2012.132 Google ScholarDigital Library
- B. Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand, B. J. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Q. Nguyen, B. Blaner, C. F. Marino, E. Retter, and P. Williams. 2011. IBM POWER7 multicore server processor. IBM Journal of Research and Development 55, 3, 1:1--1:29. Google ScholarDigital Library
- Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS’’14). Google ScholarDigital Library
- Texas Instruments. 2010. TMS320C66x DSP CPU and Instruction Set Reference Guide. Retrieved April 4, 2016, from http://www.ti.com/lit/ug/sprugh7/sprugh7.pdf.Google Scholar
- Texas Instruments. 2012. TMS320C6678 Multicore Fixed and Floating-Point Digital Signal Processor. Retrieved April 4, 2016, from http://www.ti.com.cn/cn/lit/ds/symlink/tms320c6678.pdf.Google Scholar
- Field G. Van Zee. 2012. Libflame : The Complete Reference. www.lulu.com.Google Scholar
- Field G. Van Zee, Ernie Chan, Robert van de Geijn, Enrique S. Quintana-Ortí, and Gregorio Quintana-Ortí. 2009. The libflame library for dense matrix computations. IEEE Computation in Science & Engineering 11, 6, 56--62. Google ScholarDigital Library
- Field G. Van Zee and Robert A. van de Geijn. 2012. BLIS: A Framework for Generating BLAS-Like Libraries. FLAME Working Note #66. Technical Report UTCS TR-12-30. Department of Computer Sciences, University of Texas at Austin.Google Scholar
- Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Software 41, 3, Article No. 14. Google ScholarDigital Library
- R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC’98). 1--27. Google ScholarDigital Library
- Zhang Xianyi, Wang Qian, and Zhang Yunquan. 2012. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS’12). Google ScholarDigital Library
- Kamen Yotov, Xiaoming Li, María Jesús Garzarán, David Padua, Keshav Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE 93, 2, 358--386.Google ScholarCross Ref
Index Terms
- The BLIS Framework: Experiments in Portability
Recommendations
Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework
We approach the problem of implementing mixed-datatype support within the general matrix multiplication (gemm) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A, B, and C may be stored as single- or ...
BLASFEO: Basic Linear Algebra Subroutines for Embedded Optimization
Basic Linear Algebra Subroutines for Embedded Optimization (BLASFEO) is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and small-scale high-performance ...
BLIS: A Framework for Rapidly Instantiating BLAS Functionality
The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-...
Comments