ABSTRACT
We present a parallel implementation of a cache oblivious algorithm for matrix multiplication on multicore platforms. The algorithm is based on a storage scheme and a block-recursive approach for multiplication, which are both based on a Peano space-filling curve. The recursion is stopped on matrix blocks with a size that needs to perfectly match the size of the L1 cache of the underlying CPU. The respective block multiplications are implemented by multiplication kernels that are hand-optimised for the SIMD units of current x86 CPUs. The Peano storage scheme is used to partition the block multiplications to different cores. Performance tests on various multicore platforms with up to 16 cores and different memory architecture show that the resulting implementation leads to better parallel scalability than achieved by Intel's MKL or GotoBLAS, and can outperform both libraries in terms of absolute performance on eight or more cores.
- Bader, M. and Zenger, C.: Cache oblivious matrix multiplication using an element ordering based on a Peano curve. Linear Algebra Appl. 417 (2-3), 2006.Google Scholar
- Bader, M., Franz, R., Günther, S., and Heinecke, A.: Hardware-oriented Implementation of Cache Oblivious Matrix Operations Based on Space-filling Curves, Proceedings of the PPAM 2007, LNCS 4967, 2008 (in print). Google ScholarDigital Library
- Elmroth, E., Gustavson, F., Jonsson, I., and Kågström, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46 (1), 2004.Google ScholarCross Ref
- Goto, K., and van de Geijn, R.A.: Anatomy of a High-Performance Matrix Multiplication. Accepted for publication in ACM Transactions on Mathematical Software 34 (3), 2008, preprint: http://www.cs.utexas.edu/users/flame/pubs/openflame.pdf. Google ScholarDigital Library
- GotoBLAS, Texas Advanced Computing Center, http://www.tacc.utexas.edu/resources/software/Google Scholar
- Gunnels, J.A., Gustavson, F.G., Pingali, K., Yotov, K.: Is cache-oblivious DGEMM viable? Proceedings of the PARA 2006, LNCS 4699, p. 919--928, 2007. Google ScholarDigital Library
- Gustavson, F. G.: Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development 41 (6), 1997. Google ScholarDigital Library
- Kurzak, J. and Dongarra, J.: Implementation of the Mixed-Precision High Performance LINPACK Benchmark on the Cell Processor. LAPACK Working Note 177, 2006.Google Scholar
- Intel math kernel library, 2007. http://intel.com/cd/software/products/asmo-na/eng/perflib/mkl/Google Scholar
- TifaMMy (TifaMMy isn?t the fastest Matrix Multiplication, yet), http://tifammy.sourceforge.net, version 1.3.2.Google Scholar
- Whaley, R.C., Petitet, A., and Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project, Parallel Computing 27 (1--2), 2001.Google ScholarCross Ref
Index Terms
- Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms
Recommendations
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems,...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisOpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors
Asymmetric multicore processors have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for ...
Comments