research-article

Public Access

The BLIS Framework: Experiments in Portability

Authors:
Field G. Van Zee

University of Texas at Austin, Austin, TX

University of Texas at Austin, Austin, TX
View Profile

,
Tyler M. Smith

University of Texas at Austin, Austin, TX

University of Texas at Austin, Austin, TX
View Profile

,
Bryan Marker

University of Texas at Austin, Austin, TX

University of Texas at Austin, Austin, TX
View Profile

,
Tze Meng Low

University of Texas at Austin, Austin, TX

University of Texas at Austin, Austin, TX
View Profile

,
Robert A. Van De Geijn

University of Texas at Austin, Austin, TX

University of Texas at Austin, Austin, TX
View Profile

,
Francisco D. Igual

Complutense University of Madrid, Madrid, Spain

Complutense University of Madrid, Madrid, Spain
View Profile

,
Mikhail Smelyanskiy

Intel Corporation, Santa Clara, CA

Intel Corporation, Santa Clara, CA
View Profile

,
Xianyi Zhang

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China
View Profile

,
Michael Kistler

IBM Corporation, Austin, TX

IBM Corporation, Austin, TX
View Profile

,
Vernon Austel

IBM Corporation, Yorktown Heights, NY

IBM Corporation, Yorktown Heights, NY
View Profile

,
John A. Gunnels

IBM Corporation, Yorktown Heights, NY

IBM Corporation, Yorktown Heights, NY
View Profile

,
Lee Killough

Cray Inc., Seattle, WA

Cray Inc., Seattle, WA
View Profile

Authors Info & Claims

ACM Transactions on Mathematical Software Volume 42 Issue 2Article No.: 12pp 1–19https://doi.org/10.1145/2755561

Published:03 June 2016Publication History

ACM Transactions on Mathematical Software

Abstract

BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The systems for which we demonstrate the framework include state-of-the-art general-purpose, low-power, and many-core architectures. We show, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS (an effort to maintain and extend the GotoBLAS), and commercial vendor implementations such as AMD’s ACML, IBM’s ESSL, and Intel’s MKL libraries. Although most of this article focuses on single-core implementation, we also provide compelling results that suggest the framework’s leverage extends to the multithreaded domain.

References

Murtaza Ali, Eric Stotzer, Francisco D. Igual, and Robert A. van de Geijn. 2012. Level-3 BLAS on the TI C6678 multi-core DSP. In Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’12). 179--186. DOI:http://dx.doi.org/10.1109/SBAC-PAD.2012.26 Google ScholarDigital Library
E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. 1999. LAPACK Users’ Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarDigital Library
ATLAS. 2013. ATLAS 3.8.4 ARM. Retrieved April 4, 2016, from http://www.vesperix.com/arm/atlas-arm/index.html.Google Scholar
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software 16, 1, 1--17. Google ScholarDigital Library
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Transactions on Mathematical Software 14, 1, 1--17. Google ScholarDigital Library
Freescale Semiconductor. 1999. AltiVec Technology Programming Interface Manual. Retrieved April 4, 2016, from, http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf.Google Scholar
Kazushige Goto and Robert van de Geijn. 2008a. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software 34, 3, 12:1--12:25. Google ScholarDigital Library
Kazushige Goto and Robert van de Geijn. 2008b. High-performance implementation of the level-3 BLAS. ACM Transactions on Mathematical Software 35, 1, 1--14. Google ScholarDigital Library
Michael Gschwind. 2012. Blue Gene/Q: Design for sustained multi-petaflop computing. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 245--246. DOI:http://dx.doi.org/10.1145/2304576.2304609 Google ScholarDigital Library
John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. 2001a. A family of high-performance matrix multiplication algorithms. In Computational Science—ICCS 2001. Lecture Notes in Computer Science, Vol. 2073. Springer, 51--60. Google ScholarDigital Library
John A. Gunnels, Robert A. van de Geijn, Daniel S. Katz, and Enrique S. Quintana-Orti. 2001b. Fault-tolerant high-performance matrix multiplication: Theory and practice. In Proceedings of the 2001 International Conference on Dependable Systems and Networks (DSN’01). IEEE, Los Alamitos, CA, 47--56. Google ScholarDigital Library
Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G. Shet, George Chrysos, and Pradeep Dubey. 2013. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor. In Proceedings of the 27th IEEE International Parallel & Distributed Processing Symposium (IPDPS’’13). Google ScholarDigital Library
K. Huang and J. A. Abraham. 1984. Algorithm--based fault tolerance for matrix operations. IEEE Transactions on Computers 33, 6, 518--528. Google ScholarDigital Library
IBM Blue Gene Team. 2013. Design of the IBM Blue Gene/Q compute chip. IBM Journal of Research and Development 57, 1/2, 1:1--1:13. DOI:http://dx.doi.org/10.1147/JRD.2012.2222991 Google ScholarDigital Library
Francisco D. Igual, Murtaza Ali, Arnon Friedmann, Eric Stotzer, Timothy Wentz, and Robert A. van de Geijn. 2012. Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’12). IEEE, Los Alamitos, CA, 26:1--26:11. http://dl.acm.org/citation.cfm?id=2388996.2389032 Google ScholarDigital Library
B. Kågström, P. Ling, and C. Van Loan. 1998. GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark. ACM Transactions on Mathematical Software 24, 3, 268--302. Google ScholarDigital Library
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Transactions on Mathematical Software 5, 3, 308--323. Google ScholarDigital Library
Loongson Technology. 2009. Loongson 3A Processor Manual. Loongson Technology Corp. Ltd.Google Scholar
Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Ortí. 2014. Analytical Modeling Is Enough for High Performance BLIS. Technical Report. Department of Computer Sciences, University of Texas at Austin.Google Scholar
OpenBLAS. 2012. OpenBLAS Home Page. Retrieved April 4, 2016, from http://xianyi.github.com/OpenBLAS/.Google Scholar
OpenMP Architecture Review Board. 2008. OpenMP Application Program Interface Version 3.0. Retrieved April 4, 2016, from http://www.openmp.org/mp-documents/spec30.pdf.Google Scholar
Ardavan Pedram, Andreas Gerstlauer, and Robert A. van de Geijn. 2012a. On the efficiency of register file versus broadcast interconnect for collective communications in data-parallel hardware accelerators. In Proceedings of the 2012 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’12). Google ScholarDigital Library
Ardavan Pedram, Robert A. van de Geijn, and Andreas Gerstlauer. 2012b. Codesign tradeoffs for high-performance, low-power linear algebra architectures. IEEE Transactions on Computers 61, 1724--1736. DOI:http://dx.doi.org/10.1109/TC.2012.132 Google ScholarDigital Library
B. Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand, B. J. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Q. Nguyen, B. Blaner, C. F. Marino, E. Retter, and P. Williams. 2011. IBM POWER7 multicore server processor. IBM Journal of Research and Development 55, 3, 1:1--1:29. Google ScholarDigital Library
Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS’’14). Google ScholarDigital Library
Texas Instruments. 2010. TMS320C66x DSP CPU and Instruction Set Reference Guide. Retrieved April 4, 2016, from http://www.ti.com/lit/ug/sprugh7/sprugh7.pdf.Google Scholar
Texas Instruments. 2012. TMS320C6678 Multicore Fixed and Floating-Point Digital Signal Processor. Retrieved April 4, 2016, from http://www.ti.com.cn/cn/lit/ds/symlink/tms320c6678.pdf.Google Scholar
Field G. Van Zee. 2012. Libflame : The Complete Reference. www.lulu.com.Google Scholar
Field G. Van Zee, Ernie Chan, Robert van de Geijn, Enrique S. Quintana-Ortí, and Gregorio Quintana-Ortí. 2009. The libflame library for dense matrix computations. IEEE Computation in Science & Engineering 11, 6, 56--62. Google ScholarDigital Library
Field G. Van Zee and Robert A. van de Geijn. 2012. BLIS: A Framework for Generating BLAS-Like Libraries. FLAME Working Note #66. Technical Report UTCS TR-12-30. Department of Computer Sciences, University of Texas at Austin.Google Scholar
Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Software 41, 3, Article No. 14. Google ScholarDigital Library
R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC’98). 1--27. Google ScholarDigital Library
Zhang Xianyi, Wang Qian, and Zhang Yunquan. 2012. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS’12). Google ScholarDigital Library
Kamen Yotov, Xiaoming Li, María Jesús Garzarán, David Padua, Keshav Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE 93, 2, 358--386.Google ScholarCross Ref

Index Terms

The BLIS Framework: Experiments in Portability
1. Mathematics of computing
  1. Mathematical software

Recommendations

Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework

We approach the problem of implementing mixed-datatype support within the general matrix multiplication (gemm) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A, B, and C may be stored as single- or ...
Read More
BLASFEO: Basic Linear Algebra Subroutines for Embedded Optimization

Basic Linear Algebra Subroutines for Embedded Optimization (BLASFEO) is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and small-scale high-performance ...
Read More
BLIS: A Framework for Rapidly Instantiating BLAS Functionality

The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Mathematical Software Volume 42, Issue 2
June 2016
156 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/2936306
Editor:
Michael A. Heroux
Sandia National Laboratories, USA
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 June 2016
- Accepted: 1 April 2015
- Revised: 1 February 2015
- Received: 1 August 2013
Published in toms Volume 42, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
BLAS
Linear algebra
high performance
libraries
matrix
multiplication
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 46
  Total Citations
  View Citations
- 2,183
  Total Downloads
- Downloads (Last 12 months)308
- Downloads (Last 6 weeks)71
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The BLIS Framework: Experiments in Portability

ACM Transactions on Mathematical Software

Abstract

References

Cited By

Index Terms

Recommendations

Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework

BLASFEO: Basic Linear Algebra Subroutines for Embedded Optimization

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The BLIS Framework: Experiments in Portability

ACM Transactions on Mathematical Software

Abstract

References

Cited By

Index Terms

Recommendations

Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework

BLASFEO: Basic Linear Algebra Subroutines for Embedded Optimization

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media