article

Free Access

Matrix multiplication: a case study of enhanced data cache utilization

Authors:
N. Eiron

View Profile

,
M. Rodeh

IBM Haifa

IBM Haifa
View Profile

,
I. Steinwarts

IBM Haifa

IBM Haifa
View Profile

Authors Info & Claims

ACM Journal of Experimental Algorithmics Volume 4pp 3–eshttps://doi.org/10.1145/347792.347806

Published:31 December 1999Publication History

ACM Journal of Experimental Algorithmics

Abstract

Modern machines present two challenges to algorithm engineers and compiler writers: They have superscalar, super-pipelined structure, and they have elaborate memory subsystems specifically designed to reduce latency and increase bandwidth. Matrix multiplication is a classical benchmark for experimenting with techniques used to exploit machine architecture and to overcome the limitations of contemporary memory subsystems.This research aims at advancing the state of the art of algorithm engineering by balancing instruction level parallelism, two levels of data tiling, copying to provably avoid any cache conflicts, and prefetching in parallel to computational operations, in order to fully exploit the memory bandwidth. Measurements on IBM's RS/6000 43P workstation show that the resultant matrix multiplication algorithm outperforms IBM's ESSL by 6.8-31.8%, is less sensitive to the size of the input data, and scales better.In this paper we introduce a cache aware algorithm for matrix multiplication. We also suggest generic guidelines that may be applied to compute intensive algorithm to efficiently utilize the data cache. We believe that some of our concepts may be embodied in compilers.

Supplemental Material

Available for Download

vol4nbr3.ps (368.7 KB)

tar

vol4nbr3.tex.tar (690 KB)

References

AGARWAL, R. C., GUSTAVSON, F. G., AND ZUBAIR, M. 1994. Improving performance of linear algebra algorithms for dense matrices, using algorithmic prefetch. IBM J. of Research and Development 38, 3, 265-275.]] Google ScholarDigital Library
CALLAHAN, D., KENNEDY, K., AND PORTERFIELD, A. 1991a. The cache performance and optimizations of blocked algorithms. In Proceedings of ASPLOS'91 (1991), pp. 63-74.]] Google ScholarDigital Library
CALLAHAN, D., KENNEDY, K., AND PORTERFIELD, A. 1991b. Software prefetching. In Proceedings of ASPLOS'91 (1991), pp. 40-52.]] Google ScholarDigital Library
Digital Equipment Corp. 1998. 21164 Alpha Microprocessor Data Sheet. Digital Equipment Corp.]]Google Scholar
EIRON, N., RODEH, M., AND STEINWARTS, I. 1998. Matrix multiplication: A case study of algorithm engineering. In Proceedings of WAE'98 (August 1998), pp. 40-52. Max-Plank-Institut für Informatik.]]Google Scholar
FARKAS, K. AND JOUPPI, N. 1994. Complexity/performance tradeoffs with nonblocking loads. In Proceedings of ISCA '94 (1994), pp. 211-222.]] Google ScholarDigital Library
GUSTAVSON, F.G. 1998. Personal Communication.]]Google Scholar
HENNESSY, J. L. AND PATTERSON, D. A. 1996. Computer Architecture: A Quantitive Approach (Second ed.). Morgan Kaufmann Publishers Inc.]] Google ScholarDigital Library
IBM Corp. 1997. IBM Engineering and Scientific Subroutine Library for AIX, Version 3 - Guide and Reference. IBM Corp.]]Google Scholar
IBM Microelectronics and Motorola Inc. 1994. PowerPC 604 RISC Microprocessor User's Manual. IBM Microelectronics and Motorola Inc.]]Google Scholar
LAM, M. S. 1988. Software pipelining: An effective technique for vliw machines. In Proceedings of SIGPLAN'88 (1988), pp. 318-328.]] Google ScholarDigital Library
LEE, J. H., LEE, M. Y., CHOI, S. U., AND PARK, M. S. 1994. Reducing cache conflicts in data cache prefetching. Computer Architecture News 22, 4, 71-77.]] Google ScholarDigital Library
MOWRY, T. C. 1994. Tolerating Latency Through Software-Controlled Data Prefetching. Ph. D. thesis, Stanford University.]] Google ScholarDigital Library
MOWRY, T. C., LAM, M. S., AND GUPTA, A. 1992. Design and evaluation of a compiler algorithm for data prefetching. In Proceedings of ASPLOS'92 (1992), pp. 62-73.]] Google ScholarDigital Library
NAVARRO, J. J., GARCÍA-DIEGO, E., AND HERRERO, J. R. 1992. Data prefetching and multilevel blocking for linear algebra operations. In Proceedings of ICS'96 (1992), pp. 109-116.]] Google ScholarDigital Library
STEWART, K.E. 1994. Using the xl compiler options to improve application performance. In PowerPC and POWER2, Technical Aspects of the New IBM RISC System/6000. IBM Corp.]]Google Scholar
TEMAM, 0., GRANSTON, E. D., AND JALBY, W. 1993. To copy or not to copy: A compiletime technique for assessing when data copying should be used to eliminate cache conflicts. In Proceedings of SUPERCOMPUTING'93 (1993), pp. 410-419.]] Google ScholarDigital Library

Index Terms

Matrix multiplication: a case study of enhanced data cache utilization
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Arbitrary-precision arithmetic
      2. Interval arithmetic
  2. Mathematical software

Recommendations

Increasing hardware data prefetching performance using the second-level cache

Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Read More
Adaptive prefetching using global history buffer in multicore processors

Data prefetching is a well-known technique to hide the memory latency in the last-level cache (LCC). Among many prefetching methods in recent years, the Global History Buffer (GHB) proves to be efficient in terms of cost and speedup. In this paper, we ...
Read More
A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines

Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Journal of Experimental Algorithmics Volume 4, Issue
1999
165 pages
ISSN:1084-6654
EISSN:1084-6654
DOI:10.1145/347792
Editor:
Bernard M. E. Moret
Univ. of New Mexico
Issue’s Table of Contents
Copyright © 1999 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 December 1999
Published in jea Volume 4, Issue
Author Tags
BLAS
algorithms
blocking
cache
matrix Multiplication
performance
prefetching
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 527
  Total Downloads
- Downloads (Last 12 months)135
- Downloads (Last 6 weeks)31
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Matrix multiplication: a case study of enhanced data cache utilization

ACM Journal of Experimental Algorithmics

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Increasing hardware data prefetching performance using the second-level cache

Adaptive prefetching using global history buffer in multicore processors

A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Matrix multiplication: a case study of enhanced data cache utilization

ACM Journal of Experimental Algorithmics

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Increasing hardware data prefetching performance using the second-level cache

Adaptive prefetching using global history buffer in multicore processors

A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media