skip to main content
article
Free Access

Data prefetch mechanisms

Published:01 June 2000Publication History
Skip Abstract Section

Abstract

The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently used data, it is still not uncommon for many programs to spend more than half their run times stalled on memory requests. Data prefetching has been proposed as a technique for hiding the access latency of data referencing patterns that defeat caching strategies. Rather than waiting for a cache miss to initiate a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead. Secondary effects such as cache pollution and increased memory bandwidth requirements must also be taken into consideration. Despite these obstacles, prefetching has the potential to significantly improve overall program execution time by overlapping computation with memory accesses. Prefetching strategies are diverse, and no single strategy has yet been proposed that provides optimal performance. The following survey examines several alternative approaches, and discusses the design tradeoffs involved when implementing a data prefetch strategy.

References

  1. ALEXANDER,T.AND KEDEM, G. 1996. Distributed prefetch-buffer/cache design for high performance memory systems. In Proceedings of 2nd IEEE Symposium on High-Performance Computer Architecture. IEEE Press, Piscataway, NJ, 254-263.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ANACKER,W.AND WANG, C. P. 1967. Performance evaluation of computing systems with memory hierarchies. IEEE Trans. Comput. 16,6, 764-773.]]Google ScholarGoogle ScholarCross RefCross Ref
  3. BAER, J.-L. AND CHEN, T.-F. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the 1991 Conference on Supercomputing (Albuquerque, NM, Nov. 18-22), J. L. Martin, Chair. ACM Press, New York, NY, 176-186.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. BERNSTEIN, D., COHEN, D., AND FREUND, A. 1995. Compiler techniques for data prefetching on the PowerPC. In Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques (PACT '95, Limassol, Cyprus, June 27-29), L. Bic, P. Evripidou, W. B~hm, and J.-L. Gaudiot, Chairs. IFIP Working Group on Algol, Manchester, UK, 19-26.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. BURGER, D., GOODMAN,J.R.,AND KGI, A. 1997. Limited bandwidth to affect processor design. IEEE Micro 17, 6, 55-62.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. CALLAHAN, D., KENNEDY, K., AND PORTERFIELD,A. 1991. Software prefetching. SIGARCH Comput. Arch. News 19, 2 (Apr.), 40-52.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. CASMIRA,J.P.AND KAELI, D. R. 1995. Modeling cache pollution. In Proceedings of the Second IASTED Conference on Modeling and Simulation. 123-126.]]Google ScholarGoogle Scholar
  8. CHAN, K. K. 1996. Design of the HP PA 7200 CPU. Hewlett-Packard J. 47, 1, 25-33.]]Google ScholarGoogle Scholar
  9. CHANG,P.Y.,KAELI, D., AND LIU, Y. 1994. Branchdirected data cache prefetching. In Proceedings of Second Workshop on Shared-Memory Multiprocessing Systems.]]Google ScholarGoogle Scholar
  10. CHEN, T.-F. 1995. An effective programmable prefetch engine for on-chip caches. In Proceedings of the 28th Annual International Symposium on Microarchitecture (Ann Arbor, MI, Nov. 29 - Dec. 1), T. Mudge and K. Ebciogelu, Chairs. IEEE Computer Society Press, Los Alamitos, CA, 237-242.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. CHEN, T.-F. AND BAER, J. L. 1994. A performance study of software and hardware data prefetching schemes. In Proceedings of 21st International Symposium on Computer Architecture (Chicago, IL, Apr.). 223-232.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. CHEN, T.-F. AND BEAR, J. L. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44,5 (May), 609-623.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. CHEN,W.Y.,MAHLKE,S.A.,CHANG,P.P.,AND HWU, W.-M. W. 1991. Data access microarchitectures for superscalar processors with compiler- assisted data prefetching. In Proceedings of the 24th Annual International Symposium on Microarchitecture (MICRO 24, Albuquerque, NM, Nov. 18-20), Y. K. Malaiya, Chair. ACM Press, New York, NY, 69-73.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. DAHLGREN,F.AND STENSTROM, P. 1995. Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors. In Proceedings of 1st IEEE Symposium on High-Performance Computer Architecture (Raleigh, NC, Jan.). IEEE Press, Piscataway, NJ, 68-77.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. DAHLGREN, F., DUBOIS, M., AND STENSTROM,P. 1993. Fixed and adaptive sequential prefetching in shared-memory multiprocessors. In Proceedings of the International Conference on Parallel Processing (St. Charles, IL). 56-63.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. FU, B., SAINI, A., AND GELSINGER, P. P. 1989. Performance and microarchitecture of the i486 processor. In Proceedings of the IEEE International Conference on Computer Design (Cambridge, MA). 182-187.]]Google ScholarGoogle ScholarCross RefCross Ref
  17. FU,J.W.C.AND PATEL, J. H. 1991. Data prefetching in multiprocessor vector cache memories. In Proceedings of 18th International Symposium on Computer Architecture (Toronto, Ont., Canada). 54-63.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. FU,J.W.C.,PATEL,J.H.,AND JANSSENS,B.L. 1992. Stride directed prefetching in scalar processors. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO 25, Portland, OR, Dec. 1-4), W.-m. Hwu, Chair. IEEE Computer Society Press, Los Alamitos, CA, 102-110.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. GORNISH,E.H.AND VEIDENBAUM, A. V. 1994. An integrated hardware/software scheme for shared-memory multiprocessors. In Proceedings of International Conference on Parallel Processing (St. Charles, IL). 281-284.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. GORNISH,E.H.,GRANSTON,E.D.,AND GRANSTON, A. V. 1990. Compiler-directed data prefetching in multiprocessors with memory hierarchies. In Proceedings of the 1990 ACM International Conference on Supercomputing (ICS '90, Amsterdam, The Netherlands, June 11-15), A. Sameh and H. van der Vorst, Chairs. ACM Press, New York, NY, 354-368.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. HARRISON,L.AND MEHROTRA, S. 1994. A data prefetch mechanism for accelerating general computation. 1351. University of Illinois at Urbana-Champaign, Champaign, IL.]]Google ScholarGoogle Scholar
  22. JOSEPH,D.AND GRUNWALD, D. 1997. Prefetching using Markov predictors. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA '97, Denver, CO, June 2-4), A. R. Pleszkun and T. Mudge, Chairs. ACM Press, New York, NY, 252-263.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. JOUPPI, N. 1990. Improving direct-mapped cache performance by the addition of a small fullyassociative cache and prefetch buffers. In Proceedings of the 17th International Symposium on Computer Architecture (ISCA '90, Seattle, WA, May). IEEE Press, Piscataway, NJ, 364- 373.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. KLAIBER,A.C.AND LEVY, H. M. 1991. An architecture for software-controlled data prefetching. SIGARCH Comput. Arch. News 19, 3 (May), 43-53.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. KROFT, D. 1981. Lockup-free instruction fetch/ prefetch cache organization. In Proceedings of the 8th International Symposium on Computer Architecture (Minneapolis, MN, June). ACM Press, New York, NY, 81-85.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. LILJA, D. J. 1993. Cache coherence in large-scale shared-memory multiprocessors: Issues and comparisons. ACM Comput. Surv. 25,3 (Sept.), 303-338.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. LIPASTI,M.H.,SCHMIDT,W.J.,KUNKEL,S.R.,AND ROEDIGER, R. R. 1995. SPAID: Software prefetching in pointer- and call-intensive environments. In Proceedings of the 28th Annual International Symposium on Microarchitecture (Ann Arbor, MI, Nov. 29-Dec. 1), T. Mudge and K. Ebciogclu, Chairs. IEEE Com-puter Society Press, Los Alamitos, CA, 231- 236.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. LUI,Y.AND KAELI, D. R. 1996. Branch-directed and stride-based data cache prefetching. In Proceedings of International Conference on Computer Design (ICCD'96, Austin, TX). IEEE Computer Society Press, Los Alamitos, CA, 255-230.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. LUK, C.-K. AND MOWRY, T. C. 1996. Compilerbased prefetching for recursive data structures. ACM SIGOPS Oper. Syst. Rev. 30, 5, 222-233.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. MOWRY,T.AND GUPTA, A. 1991. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. J. Parallel Distrib. Comput. 12, 2 (June), 87-106.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. MOWRY,T.C.,LAM,M.S.,AND GUPTA, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V, Boston, MA, Oct. 12-15), S. Eggers, Chair. ACM Press, New York, NY, 62-73.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. OBERLIN, S., KESSLER, R., SCOTT, S., AND THORSON, G. 1996. Cray T3E architecture overview. Cray Supercomputers, Chippewa Falls, MN.]]Google ScholarGoogle Scholar
  33. PALACHARLA,S.AND KESSLER, R. E. 1994. Evaluating stream buffers as a secondary cache replacement. In Proceedings of 21st International Symposium on Computer Architecture (Chicago, IL, Apr.).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. PATTERSON,R.H.AND GIBSON, G. A. 1994. Exposing I/O concurrency with informed prefetching. In Proceedings of the Third International Conference on Parallel and Distributed Information Systems (Austin, TX). 7-16.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. PORTERFIELD, A. K. 1989. Software methods for improvement of cache performance on supercomputer applications. Ph.D thesis,. Ph.D. Dissertation. Rice University, Houston, TX.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. PRZYBYLSKI, S. 1990. The performance impact of block sizes and fetch strategies. In Proceedings of the 17th International Symposium on Computer Architecture (Seattle, WA). 160- 169.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. ROTH, A., MOSHOVOS, A., AND SOHI, G. 1998. Dependance based prefetching for linked data structures. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII, San Jose, CA, Oct. 3-7), D. Bhandarkar and A. Agarwal, Chairs. ACM Press, New York, NY.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. SANTHANAM, V., GORNISH,E.H.,AND HSU, W.-C. 1997. Data prefetching on the HP PA-8000. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA '97, Denver, CO, June 2-4), A. R. Pleszkun and T. Mudge, Chairs. ACM Press, New York, NY, 264-273.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. SKLENAR, I. 1992. Prefetch unit for vector operations on scalar computers. In Proceedings of the 19th International Symposium on Computer Architecture (Gold Coast, Qld., Australia). 31-37.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. SMITH, A. J. 1978. Sequential program prefetching in memory hierarchies. IEEE Computer 11, 12, 7-21.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. SMITH, A. J. 1982. Cache memories. ACM Comput. Surv. 14, 3 (Sept.), 473-530.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. SMITH, J. E. 1981. A study of branch prediction techniques. In Proceedings of the 8th International Symposium on Computer Architecture (Minneapolis, MN, June). ACM Press, New York, NY, 135-147.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. VANDERWIEL,S.P.AND LILJA, D. J. 1999. A compiler-assisted data prefetch controller. In Proceedings of International Conference on Computer Design (ICCD '99, Austin TX).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. VANDERWIEL,S.P.,HSU,W.C.,AND LILJA,D.J. 1997. When caches are not enough: Data prefetching techniques. IEEE Computer 30,7, 23-27.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. YEAGER, K. C. 1996. The MIPS R10000 superscalar microprocessor. IEEE Micro 16, 2 (Apr.), 28-40.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. YOUNG,H.C.AND SHEKITA, E. J. 1993. An intelligent I-cache prefetch mechanism. In Proceedings of the International Conference on Computer Design (ICCD'93, Cambridge, MA). IEEE Computer Society Press, Los Alamitos, CA, 44-49.]]Google ScholarGoogle ScholarCross RefCross Ref
  47. ZHANG,Z.AND TORRELLAS, J. 1995. Speeding up irregular applications in shared-memory multiprocessors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA '95, Santa Margherita Ligure, Italy, June 22-24), D. A. Patterson, Chair. ACM Press, New York, NY.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data prefetch mechanisms

    Recommendations

    Reviews

    Andrea Prati

    The growing gap between microprocessor and memory performance has elected cache prefetching as a key research issue in recent years. Vanderwiel and Lilja present in this paper a survey on data prefetch mechanisms both for single processor and for multiprocessor architectures. The first part of this work give the necessary background to who wants to study cache prefetching. For its wholeness and clear explanation, this part is strongly recommended to the researchers that begin their work on cache prefetching. The introduction outlines the ideas underlying prefetching methods as well as the drawbacks of a incorrect prefetching policy (cache pollution, unnecessary prefetches which lead to a more demanding bandwidth, and so on). The second part of the paper describes the existing solutions to an efficient and effective prefetch mechanism. Single processor architecture is first considered and the proposed approaches are classified into software and hardware data prefetching. Software data prefetching is mainly based on the manual or pre-compiler driven inclusion of special instructions that force the cache to fetch a given datum at a given time. This approach has the drawbacks of the correct prefetch scheduling, the increase of register pressure and the degradation of the instruction cache performance (due to inclusion of many new "fetch" instructions). Hardware data prefetching is subdivided in sequential prefetching (well-known as One Block Lookahead) and its variants, and prefetching with arbitrary strides. A clear and understandable overview of this approach is depicted. The solutions proposed in literature that combine the advantages of both these approaches are reported too. Finally, the additional problems implied in multiprocessor architecture are listed and discussed. In conclusion, this paper is very useful, well-structured and clear. It covers most of the topics on cache prefetching and the conclusions give a helpful suggestion of "when, where and what" must be prefetched into the cache. Nevertheless, some recent researches are not included in this survey, mainly because of their later publication. The advantage of using different prefetching method depending of the type of locality exhibited [1,2] or the improvement to strong data-demanding applications (as the multimedia applications) using specialized prefetching methods [3,4] are not mentioned and probably they were not in the aim of this work. Finally, a discussion on the metrics suitable for cache performance evaluation both in terms of cache misses reduction and in terms of time saved lacks. [1] Milutinovic, V. and Tomasevic, M. and Markovi, B. and Tremblay, M. "A new cache architecture concept: the split temporal/spatial cache", in Proceedings of 8th Mediterranean Electrotechnical Conference, 1996. MELECON '96. Vol. 2, pp. 1108-1111. [2] Gonzales, A. and Aliagas, C. and Valero, M. "A data cache with multiple caching strategies tuned to different types of locality", in Proceedings of the 9th ACM International Conference on Supercomputing, pp. 338-347, 1995. [3] Zucker, D.F. and Lee, R.B. and Flynn, M.J. "Hardware and software cache prefetching techniques for MPEG benchmarks". IEEE Transaction on Circuits and Systems for Video Technology, vol. 10, No. 5 (Aug. 2000), pp. 782-796. [4] Cucchiara, R. and Piccardi, M. and Prati, A. "Temporal analysis of cache prefetching strategies for multimedia applications", in Proceedings of 20th IEEE International Performance, Computing e Communications Conference (IPCCC 2001), Aprile 2001, pp. 311-318.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 32, Issue 2
      June 2000
      91 pages
      ISSN:0360-0300
      EISSN:1557-7341
      DOI:10.1145/358923
      Issue’s Table of Contents

      Copyright © 2000 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 June 2000
      Published in csur Volume 32, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader