skip to main content
article
Free Access

Matrix multiplication: a case study of enhanced data cache utilization

Published:31 December 1999Publication History
Skip Abstract Section

Abstract

Modern machines present two challenges to algorithm engineers and compiler writers: They have superscalar, super-pipelined structure, and they have elaborate memory subsystems specifically designed to reduce latency and increase bandwidth. Matrix multiplication is a classical benchmark for experimenting with techniques used to exploit machine architecture and to overcome the limitations of contemporary memory subsystems.This research aims at advancing the state of the art of algorithm engineering by balancing instruction level parallelism, two levels of data tiling, copying to provably avoid any cache conflicts, and prefetching in parallel to computational operations, in order to fully exploit the memory bandwidth. Measurements on IBM's RS/6000 43P workstation show that the resultant matrix multiplication algorithm outperforms IBM's ESSL by 6.8-31.8%, is less sensitive to the size of the input data, and scales better.In this paper we introduce a cache aware algorithm for matrix multiplication. We also suggest generic guidelines that may be applied to compute intensive algorithm to efficiently utilize the data cache. We believe that some of our concepts may be embodied in compilers.

Skip Supplemental Material Section

Supplemental Material

References

  1. AGARWAL, R. C., GUSTAVSON, F. G., AND ZUBAIR, M. 1994. Improving performance of linear algebra algorithms for dense matrices, using algorithmic prefetch. IBM J. of Research and Development 38, 3, 265-275.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. CALLAHAN, D., KENNEDY, K., AND PORTERFIELD, A. 1991a. The cache performance and optimizations of blocked algorithms. In Proceedings of ASPLOS'91 (1991), pp. 63-74.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. CALLAHAN, D., KENNEDY, K., AND PORTERFIELD, A. 1991b. Software prefetching. In Proceedings of ASPLOS'91 (1991), pp. 40-52.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Digital Equipment Corp. 1998. 21164 Alpha Microprocessor Data Sheet. Digital Equipment Corp.]]Google ScholarGoogle Scholar
  5. EIRON, N., RODEH, M., AND STEINWARTS, I. 1998. Matrix multiplication: A case study of algorithm engineering. In Proceedings of WAE'98 (August 1998), pp. 40-52. Max-Plank-Institut für Informatik.]]Google ScholarGoogle Scholar
  6. FARKAS, K. AND JOUPPI, N. 1994. Complexity/performance tradeoffs with nonblocking loads. In Proceedings of ISCA '94 (1994), pp. 211-222.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. GUSTAVSON, F.G. 1998. Personal Communication.]]Google ScholarGoogle Scholar
  8. HENNESSY, J. L. AND PATTERSON, D. A. 1996. Computer Architecture: A Quantitive Approach (Second ed.). Morgan Kaufmann Publishers Inc.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. IBM Corp. 1997. IBM Engineering and Scientific Subroutine Library for AIX, Version 3 - Guide and Reference. IBM Corp.]]Google ScholarGoogle Scholar
  10. IBM Microelectronics and Motorola Inc. 1994. PowerPC 604 RISC Microprocessor User's Manual. IBM Microelectronics and Motorola Inc.]]Google ScholarGoogle Scholar
  11. LAM, M. S. 1988. Software pipelining: An effective technique for vliw machines. In Proceedings of SIGPLAN'88 (1988), pp. 318-328.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. LEE, J. H., LEE, M. Y., CHOI, S. U., AND PARK, M. S. 1994. Reducing cache conflicts in data cache prefetching. Computer Architecture News 22, 4, 71-77.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. MOWRY, T. C. 1994. Tolerating Latency Through Software-Controlled Data Prefetching. Ph. D. thesis, Stanford University.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. MOWRY, T. C., LAM, M. S., AND GUPTA, A. 1992. Design and evaluation of a compiler algorithm for data prefetching. In Proceedings of ASPLOS'92 (1992), pp. 62-73.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. NAVARRO, J. J., GARCÍA-DIEGO, E., AND HERRERO, J. R. 1992. Data prefetching and multilevel blocking for linear algebra operations. In Proceedings of ICS'96 (1992), pp. 109-116.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. STEWART, K.E. 1994. Using the xl compiler options to improve application performance. In PowerPC and POWER2, Technical Aspects of the New IBM RISC System/6000. IBM Corp.]]Google ScholarGoogle Scholar
  17. TEMAM, 0., GRANSTON, E. D., AND JALBY, W. 1993. To copy or not to copy: A compiletime technique for assessing when data copying should be used to eliminate cache conflicts. In Proceedings of SUPERCOMPUTING'93 (1993), pp. 410-419.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Matrix multiplication: a case study of enhanced data cache utilization

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Journal of Experimental Algorithmics
          ACM Journal of Experimental Algorithmics  Volume 4, Issue
          1999
          165 pages
          ISSN:1084-6654
          EISSN:1084-6654
          DOI:10.1145/347792
          Issue’s Table of Contents

          Copyright © 1999 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 31 December 1999
          Published in jea Volume 4, Issue

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader