skip to main content
10.1145/1366219.1366223acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

Published:05 May 2008Publication History

ABSTRACT

We present a parallel implementation of a cache oblivious algorithm for matrix multiplication on multicore platforms. The algorithm is based on a storage scheme and a block-recursive approach for multiplication, which are both based on a Peano space-filling curve. The recursion is stopped on matrix blocks with a size that needs to perfectly match the size of the L1 cache of the underlying CPU. The respective block multiplications are implemented by multiplication kernels that are hand-optimised for the SIMD units of current x86 CPUs. The Peano storage scheme is used to partition the block multiplications to different cores. Performance tests on various multicore platforms with up to 16 cores and different memory architecture show that the resulting implementation leads to better parallel scalability than achieved by Intel's MKL or GotoBLAS, and can outperform both libraries in terms of absolute performance on eight or more cores.

References

  1. Bader, M. and Zenger, C.: Cache oblivious matrix multiplication using an element ordering based on a Peano curve. Linear Algebra Appl. 417 (2-3), 2006.Google ScholarGoogle Scholar
  2. Bader, M., Franz, R., Günther, S., and Heinecke, A.: Hardware-oriented Implementation of Cache Oblivious Matrix Operations Based on Space-filling Curves, Proceedings of the PPAM 2007, LNCS 4967, 2008 (in print). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Elmroth, E., Gustavson, F., Jonsson, I., and Kågström, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46 (1), 2004.Google ScholarGoogle ScholarCross RefCross Ref
  4. Goto, K., and van de Geijn, R.A.: Anatomy of a High-Performance Matrix Multiplication. Accepted for publication in ACM Transactions on Mathematical Software 34 (3), 2008, preprint: http://www.cs.utexas.edu/users/flame/pubs/openflame.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. GotoBLAS, Texas Advanced Computing Center, http://www.tacc.utexas.edu/resources/software/Google ScholarGoogle Scholar
  6. Gunnels, J.A., Gustavson, F.G., Pingali, K., Yotov, K.: Is cache-oblivious DGEMM viable? Proceedings of the PARA 2006, LNCS 4699, p. 919--928, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Gustavson, F. G.: Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development 41 (6), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kurzak, J. and Dongarra, J.: Implementation of the Mixed-Precision High Performance LINPACK Benchmark on the Cell Processor. LAPACK Working Note 177, 2006.Google ScholarGoogle Scholar
  9. Intel math kernel library, 2007. http://intel.com/cd/software/products/asmo-na/eng/perflib/mkl/Google ScholarGoogle Scholar
  10. TifaMMy (TifaMMy isn?t the fastest Matrix Multiplication, yet), http://tifammy.sourceforge.net, version 1.3.2.Google ScholarGoogle Scholar
  11. Whaley, R.C., Petitet, A., and Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project, Parallel Computing 27 (1--2), 2001.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MAW '08: Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
      May 2008
      44 pages
      ISBN:9781605580913
      DOI:10.1145/1366219

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 May 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      MAW '08 Paper Acceptance Rate4of8submissions,50%Overall Acceptance Rate4of8submissions,50%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader