skip to main content
research-article
Public Access

The BLIS Framework: Experiments in Portability

Published:03 June 2016Publication History
Skip Abstract Section

Abstract

BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The systems for which we demonstrate the framework include state-of-the-art general-purpose, low-power, and many-core architectures. We show, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS (an effort to maintain and extend the GotoBLAS), and commercial vendor implementations such as AMD’s ACML, IBM’s ESSL, and Intel’s MKL libraries. Although most of this article focuses on single-core implementation, we also provide compelling results that suggest the framework’s leverage extends to the multithreaded domain.

References

  1. Murtaza Ali, Eric Stotzer, Francisco D. Igual, and Robert A. van de Geijn. 2012. Level-3 BLAS on the TI C6678 multi-core DSP. In Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’12). 179--186. DOI:http://dx.doi.org/10.1109/SBAC-PAD.2012.26 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. 1999. LAPACK Users’ Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. ATLAS. 2013. ATLAS 3.8.4 ARM. Retrieved April 4, 2016, from http://www.vesperix.com/arm/atlas-arm/index.html.Google ScholarGoogle Scholar
  4. Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software 16, 1, 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Transactions on Mathematical Software 14, 1, 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Freescale Semiconductor. 1999. AltiVec Technology Programming Interface Manual. Retrieved April 4, 2016, from, http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf.Google ScholarGoogle Scholar
  7. Kazushige Goto and Robert van de Geijn. 2008a. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software 34, 3, 12:1--12:25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kazushige Goto and Robert van de Geijn. 2008b. High-performance implementation of the level-3 BLAS. ACM Transactions on Mathematical Software 35, 1, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Michael Gschwind. 2012. Blue Gene/Q: Design for sustained multi-petaflop computing. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 245--246. DOI:http://dx.doi.org/10.1145/2304576.2304609 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. 2001a. A family of high-performance matrix multiplication algorithms. In Computational Science—ICCS 2001. Lecture Notes in Computer Science, Vol. 2073. Springer, 51--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. John A. Gunnels, Robert A. van de Geijn, Daniel S. Katz, and Enrique S. Quintana-Orti. 2001b. Fault-tolerant high-performance matrix multiplication: Theory and practice. In Proceedings of the 2001 International Conference on Dependable Systems and Networks (DSN’01). IEEE, Los Alamitos, CA, 47--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G. Shet, George Chrysos, and Pradeep Dubey. 2013. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor. In Proceedings of the 27th IEEE International Parallel & Distributed Processing Symposium (IPDPS’’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Huang and J. A. Abraham. 1984. Algorithm--based fault tolerance for matrix operations. IEEE Transactions on Computers 33, 6, 518--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. IBM Blue Gene Team. 2013. Design of the IBM Blue Gene/Q compute chip. IBM Journal of Research and Development 57, 1/2, 1:1--1:13. DOI:http://dx.doi.org/10.1147/JRD.2012.2222991 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Francisco D. Igual, Murtaza Ali, Arnon Friedmann, Eric Stotzer, Timothy Wentz, and Robert A. van de Geijn. 2012. Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’12). IEEE, Los Alamitos, CA, 26:1--26:11. http://dl.acm.org/citation.cfm?id=2388996.2389032 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Kågström, P. Ling, and C. Van Loan. 1998. GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark. ACM Transactions on Mathematical Software 24, 3, 268--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Transactions on Mathematical Software 5, 3, 308--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Loongson Technology. 2009. Loongson 3A Processor Manual. Loongson Technology Corp. Ltd.Google ScholarGoogle Scholar
  19. Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Ortí. 2014. Analytical Modeling Is Enough for High Performance BLIS. Technical Report. Department of Computer Sciences, University of Texas at Austin.Google ScholarGoogle Scholar
  20. OpenBLAS. 2012. OpenBLAS Home Page. Retrieved April 4, 2016, from http://xianyi.github.com/OpenBLAS/.Google ScholarGoogle Scholar
  21. OpenMP Architecture Review Board. 2008. OpenMP Application Program Interface Version 3.0. Retrieved April 4, 2016, from http://www.openmp.org/mp-documents/spec30.pdf.Google ScholarGoogle Scholar
  22. Ardavan Pedram, Andreas Gerstlauer, and Robert A. van de Geijn. 2012a. On the efficiency of register file versus broadcast interconnect for collective communications in data-parallel hardware accelerators. In Proceedings of the 2012 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ardavan Pedram, Robert A. van de Geijn, and Andreas Gerstlauer. 2012b. Codesign tradeoffs for high-performance, low-power linear algebra architectures. IEEE Transactions on Computers 61, 1724--1736. DOI:http://dx.doi.org/10.1109/TC.2012.132 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand, B. J. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Q. Nguyen, B. Blaner, C. F. Marino, E. Retter, and P. Williams. 2011. IBM POWER7 multicore server processor. IBM Journal of Research and Development 55, 3, 1:1--1:29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS’’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Texas Instruments. 2010. TMS320C66x DSP CPU and Instruction Set Reference Guide. Retrieved April 4, 2016, from http://www.ti.com/lit/ug/sprugh7/sprugh7.pdf.Google ScholarGoogle Scholar
  27. Texas Instruments. 2012. TMS320C6678 Multicore Fixed and Floating-Point Digital Signal Processor. Retrieved April 4, 2016, from http://www.ti.com.cn/cn/lit/ds/symlink/tms320c6678.pdf.Google ScholarGoogle Scholar
  28. Field G. Van Zee. 2012. Libflame : The Complete Reference. www.lulu.com.Google ScholarGoogle Scholar
  29. Field G. Van Zee, Ernie Chan, Robert van de Geijn, Enrique S. Quintana-Ortí, and Gregorio Quintana-Ortí. 2009. The libflame library for dense matrix computations. IEEE Computation in Science & Engineering 11, 6, 56--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Field G. Van Zee and Robert A. van de Geijn. 2012. BLIS: A Framework for Generating BLAS-Like Libraries. FLAME Working Note #66. Technical Report UTCS TR-12-30. Department of Computer Sciences, University of Texas at Austin.Google ScholarGoogle Scholar
  31. Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Software 41, 3, Article No. 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC’98). 1--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zhang Xianyi, Wang Qian, and Zhang Yunquan. 2012. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kamen Yotov, Xiaoming Li, María Jesús Garzarán, David Padua, Keshav Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE 93, 2, 358--386.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. The BLIS Framework: Experiments in Portability

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Mathematical Software
      ACM Transactions on Mathematical Software  Volume 42, Issue 2
      June 2016
      156 pages
      ISSN:0098-3500
      EISSN:1557-7295
      DOI:10.1145/2936306
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 June 2016
      • Accepted: 1 April 2015
      • Revised: 1 February 2015
      • Received: 1 August 2013
      Published in toms Volume 42, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader