research-article

Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

Authors:
Alexander Heinecke

TU München, München, Germany

TU München, München, Germany
View Profile

,
Michael Bader

TU München, München, Germany

TU München, München, Germany
View Profile

MAW '08: Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?May 2008Pages 385–392https://doi.org/10.1145/1366219.1366223

Published:05 May 2008Publication History

MAW '08: Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?

Pages 385–392

ABSTRACT

We present a parallel implementation of a cache oblivious algorithm for matrix multiplication on multicore platforms. The algorithm is based on a storage scheme and a block-recursive approach for multiplication, which are both based on a Peano space-filling curve. The recursion is stopped on matrix blocks with a size that needs to perfectly match the size of the L1 cache of the underlying CPU. The respective block multiplications are implemented by multiplication kernels that are hand-optimised for the SIMD units of current x86 CPUs. The Peano storage scheme is used to partition the block multiplications to different cores. Performance tests on various multicore platforms with up to 16 cores and different memory architecture show that the resulting implementation leads to better parallel scalability than achieved by Intel's MKL or GotoBLAS, and can outperform both libraries in terms of absolute performance on eight or more cores.

References

Bader, M. and Zenger, C.: Cache oblivious matrix multiplication using an element ordering based on a Peano curve. Linear Algebra Appl. 417 (2-3), 2006.Google Scholar
Bader, M., Franz, R., Günther, S., and Heinecke, A.: Hardware-oriented Implementation of Cache Oblivious Matrix Operations Based on Space-filling Curves, Proceedings of the PPAM 2007, LNCS 4967, 2008 (in print). Google ScholarDigital Library
Elmroth, E., Gustavson, F., Jonsson, I., and Kågström, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46 (1), 2004.Google ScholarCross Ref
Goto, K., and van de Geijn, R.A.: Anatomy of a High-Performance Matrix Multiplication. Accepted for publication in ACM Transactions on Mathematical Software 34 (3), 2008, preprint: http://www.cs.utexas.edu/users/flame/pubs/openflame.pdf. Google ScholarDigital Library
GotoBLAS, Texas Advanced Computing Center, http://www.tacc.utexas.edu/resources/software/Google Scholar
Gunnels, J.A., Gustavson, F.G., Pingali, K., Yotov, K.: Is cache-oblivious DGEMM viable? Proceedings of the PARA 2006, LNCS 4699, p. 919--928, 2007. Google ScholarDigital Library
Gustavson, F. G.: Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development 41 (6), 1997. Google ScholarDigital Library
Kurzak, J. and Dongarra, J.: Implementation of the Mixed-Precision High Performance LINPACK Benchmark on the Cell Processor. LAPACK Working Note 177, 2006.Google Scholar
Intel math kernel library, 2007. http://intel.com/cd/software/products/asmo-na/eng/perflib/mkl/Google Scholar
TifaMMy (TifaMMy isn?t the fastest Matrix Multiplication, yet), http://tifammy.sourceforge.net, version 1.3.2.Google Scholar
Whaley, R.C., Petitet, A., and Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project, Parallel Computing 27 (1--2), 2001.Google ScholarCross Ref

Index Terms

Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms
1. Mathematics of computing
  1. Mathematical software

Recommendations

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems,...
Read More
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Read More
Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Asymmetric multicore processors have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MAW '08: Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
May 2008
44 pages
ISBN:9781605580913
DOI:10.1145/1366219
Program Chairs:
Josef Weidendorfer
Technische Universität München, Germany
,
Jie Tao
Institute for Scientific Computing Karlsruhe, Germany
,
Rainer Buchty
Universität Karlsruhe, Germany
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 May 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache oblivious algorithms
matrix multiplication
multicore
parallelisation
space-filling curves
Qualifiers
- research-article
Conference

Acceptance Rates
MAW '08 Paper Acceptance Rate4of8submissions,50%Overall Acceptance Rate4of8submissions,50%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 638
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

MAW '08: Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

MAW '08: Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media