Abstract
Modern machines present two challenges to algorithm engineers and compiler writers: They have superscalar, super-pipelined structure, and they have elaborate memory subsystems specifically designed to reduce latency and increase bandwidth. Matrix multiplication is a classical benchmark for experimenting with techniques used to exploit machine architecture and to overcome the limitations of contemporary memory subsystems.This research aims at advancing the state of the art of algorithm engineering by balancing instruction level parallelism, two levels of data tiling, copying to provably avoid any cache conflicts, and prefetching in parallel to computational operations, in order to fully exploit the memory bandwidth. Measurements on IBM's RS/6000 43P workstation show that the resultant matrix multiplication algorithm outperforms IBM's ESSL by 6.8-31.8%, is less sensitive to the size of the input data, and scales better.In this paper we introduce a cache aware algorithm for matrix multiplication. We also suggest generic guidelines that may be applied to compute intensive algorithm to efficiently utilize the data cache. We believe that some of our concepts may be embodied in compilers.
Supplemental Material
Available for Download
- AGARWAL, R. C., GUSTAVSON, F. G., AND ZUBAIR, M. 1994. Improving performance of linear algebra algorithms for dense matrices, using algorithmic prefetch. IBM J. of Research and Development 38, 3, 265-275.]] Google ScholarDigital Library
- CALLAHAN, D., KENNEDY, K., AND PORTERFIELD, A. 1991a. The cache performance and optimizations of blocked algorithms. In Proceedings of ASPLOS'91 (1991), pp. 63-74.]] Google ScholarDigital Library
- CALLAHAN, D., KENNEDY, K., AND PORTERFIELD, A. 1991b. Software prefetching. In Proceedings of ASPLOS'91 (1991), pp. 40-52.]] Google ScholarDigital Library
- Digital Equipment Corp. 1998. 21164 Alpha Microprocessor Data Sheet. Digital Equipment Corp.]]Google Scholar
- EIRON, N., RODEH, M., AND STEINWARTS, I. 1998. Matrix multiplication: A case study of algorithm engineering. In Proceedings of WAE'98 (August 1998), pp. 40-52. Max-Plank-Institut für Informatik.]]Google Scholar
- FARKAS, K. AND JOUPPI, N. 1994. Complexity/performance tradeoffs with nonblocking loads. In Proceedings of ISCA '94 (1994), pp. 211-222.]] Google ScholarDigital Library
- GUSTAVSON, F.G. 1998. Personal Communication.]]Google Scholar
- HENNESSY, J. L. AND PATTERSON, D. A. 1996. Computer Architecture: A Quantitive Approach (Second ed.). Morgan Kaufmann Publishers Inc.]] Google ScholarDigital Library
- IBM Corp. 1997. IBM Engineering and Scientific Subroutine Library for AIX, Version 3 - Guide and Reference. IBM Corp.]]Google Scholar
- IBM Microelectronics and Motorola Inc. 1994. PowerPC 604 RISC Microprocessor User's Manual. IBM Microelectronics and Motorola Inc.]]Google Scholar
- LAM, M. S. 1988. Software pipelining: An effective technique for vliw machines. In Proceedings of SIGPLAN'88 (1988), pp. 318-328.]] Google ScholarDigital Library
- LEE, J. H., LEE, M. Y., CHOI, S. U., AND PARK, M. S. 1994. Reducing cache conflicts in data cache prefetching. Computer Architecture News 22, 4, 71-77.]] Google ScholarDigital Library
- MOWRY, T. C. 1994. Tolerating Latency Through Software-Controlled Data Prefetching. Ph. D. thesis, Stanford University.]] Google ScholarDigital Library
- MOWRY, T. C., LAM, M. S., AND GUPTA, A. 1992. Design and evaluation of a compiler algorithm for data prefetching. In Proceedings of ASPLOS'92 (1992), pp. 62-73.]] Google ScholarDigital Library
- NAVARRO, J. J., GARCÍA-DIEGO, E., AND HERRERO, J. R. 1992. Data prefetching and multilevel blocking for linear algebra operations. In Proceedings of ICS'96 (1992), pp. 109-116.]] Google ScholarDigital Library
- STEWART, K.E. 1994. Using the xl compiler options to improve application performance. In PowerPC and POWER2, Technical Aspects of the New IBM RISC System/6000. IBM Corp.]]Google Scholar
- TEMAM, 0., GRANSTON, E. D., AND JALBY, W. 1993. To copy or not to copy: A compiletime technique for assessing when data copying should be used to eliminate cache conflicts. In Proceedings of SUPERCOMPUTING'93 (1993), pp. 410-419.]] Google ScholarDigital Library
Index Terms
- Matrix multiplication: a case study of enhanced data cache utilization
Recommendations
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Adaptive prefetching using global history buffer in multicore processors
Data prefetching is a well-known technique to hide the memory latency in the last-level cache (LCC). Among many prefetching methods in recent years, the Global History Buffer (GHB) proves to be efficient in terms of cost and speedup. In this paper, we ...
A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines
Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, ...
Comments