skip to main content
10.1145/859618.859663acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
Article

Guided region prefetching: a cooperative hardware/software approach

Published:01 May 2003Publication History

ABSTRACT

Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but is limited by prefetch instruction overheads and the compiler's limited ability to schedule prefetches sufficiently far in advance to cover level-two cache miss latencies. Hardware prefetching can be effective at hiding these large latencies, but generates many useless prefetches and consumes considerable memory bandwidth. In this paper, we propose a cooperative hardware-software prefetching scheme called Guided Region Prefetching (GRP), which uses compiler-generated hints encoded in load instructions to regulate an aggressive hardware prefetching engine. We compare GRP against a sophisticated pure hardware stride prefetcher and a scheduled region prefetching (SRP) engine. SRP and GRP show the best performance, with respective 22% and 21% gains over no prefetching, but SRP incurs 180% extra memory traffic---nearly tripling bandwidth requirements. GRP achieves performance close to SRP, but with a mere eighth of the extra prefetching traffic, a 23% increase over no prefetching. The GRP hardware-software collaboration thus combines the accuracy of compilerbased program analysis with the performance potential of aggressive hardware prefetching, bringing the performance gap versus a perfect L2 cache under 20%.

References

  1. T. Alexander and G. Kedem. Distributed predictive cache design for high performance memory system. In Second International Symposium on High Performance Computer Architecture, Feb. 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 52--61, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Architecture and Language Implementation Group, University of Massachusetts, Amherst. Scale compiler infrastructure. In http://aliwww. cs.umass.edu/Scale.]]Google ScholarGoogle Scholar
  4. D. Burger and T. M. Austin. The simplescalar tool set version 2.0. Technical Report 1342, Computer Sciences Department, University of Wisconsin, June 1997.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Cahoon and K. S. McKinley. Data flow analysis for software prefetching linked data structures in Java. In Proceedings of the 2001 International Conference on Parallel Architectures and Compiler Techniques, pages 280--291, Barcelona, Spain, Sept. 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Cahoon and K. S. McKinley. Simple and effective array prefetching for Java. In ACM Java Grande, pages 86--95, Seattle, WA, Nov. 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 40--52, Santa Clara, CA, Apr. 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Carr, K. S. McKinley, and C. Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252--262, San Jose, CA, Oct. 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Charney and A. Reeves. Generalized correlation-based hardware prefetching. Technical Report EE CEG 95-1, Cornell University, Feb. 1995.]]Google ScholarGoogle Scholar
  10. T. Chen and J. Baer. Reducing memory latency via non-blocking and prefetching caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 51--61, Boston, MA, Oct. 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Chen and J. Baer. Effective hardware based data prefetching. IEEE Transactions on Computers, 44(5):609--623, May 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T.-F. Chen. An effective programmable prefetch engine for high-performance processors. In Proceedings of the 29th International Symposium on Microarchitecture, Ann Arbor, Michigan, Nov. 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. P. Shen. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the 28th International Symposium on Computer Architecture, pages 14--25, June 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Cooksey, S. Jordan, and D. Grunwald. A stateless, content-directed data prefetching mechanism. In Proceedings of the Tenth Annual International Conference on Architectural Support for Programming Languages and Operating Systems, pages 279--290, San Jose, CA, October 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. F. Dahlgren, M. Dubois, and P. Stenstrom. Fixed and adaptive sequential prefetching in shared-memory multiprocessors. In Proceedings of the 1993 International Conference on Parallel Processing, pages 56--63, St Charles, IL, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. Dahlgren and P. Stenstrom. Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors. In First International Symposium on High Performance Computer Architecture, pages 68--77, Raleigh, NC, Jan. 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228--239, San Jose, CA, Oct. 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. H. Gornish and A. V. Veidenbaum. An integrated hardware/software scheme for shared-memory multiprocessors. In Proceedings of the 1994 International Conference on Parallel Processing, pages 281--284, St Charles, IL, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. J. Hughes and S. Adve. Memory-side prefetching for linked data structures. Technical Report UIUCDCS-R-2001-2221, University of Illinios, Urbana Champagne, May 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Huh, D. Burger, and S. W. Keckler. Exploring the design space of future CMPs. In Proceedings of the 2001 International Conference on Parallel A rchitectures and Compilation Techniques, pages 199--210, Sep 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Joseph and D. Grunwald. Prefetching using Markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 252--263, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th International Symposium on Computer Architecture, pages 364--373, Seattle, WA, June 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Karlsson, F. Dahlgren, and P. Sternstrom. A prefetching technique for irregular accesses to linked data structures. In Sixth International Symposium on High Performance Computer Architecture, page 206, Toulouse, France, Jan. 2000.]]Google ScholarGoogle Scholar
  24. D. Kim and D. Yeung. Design and evaluation of compiler algorithms for pre-execution. In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 159--170, San Jose, CA, Oct. 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. C. Klaiber and H. M. Levy. An architecture for software-controlled data prefetching. In Proceedings of the 18th International Symposium on Computer Architecture, pages 43--53, Toronto, Canada, May 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A.-C. Lai, C. Fide, and B. Falsafi. Dead-block prediction and deadblock correlating prefetchers. In Proceedings of the 28th International Symposium on Computer Architecture, July 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K.-F. Lee, H.-W. Hon, and R. Reddy. An overview of the SPHINX speech recoginition system. In IEEE Transactions on Acoustics, Speech and Signal Precessing, volume 38(1), pages 35--44, 1990.]]Google ScholarGoogle ScholarCross RefCross Ref
  28. W.-F. Lin, S. K. Reinhardt, and D. Burger. Reducing DRAM latencies with an integrated memory hierarchy design. In Proceedings of the 7th International Symposium on High Performance Computer Architecture, pages 301--312, Jan 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. H. Lipasti, W. J. Schmidt, S. R. Kunkel, and R. R. Roediger. SPAID: Software prefetching in pointer- and call-intensive environments. In Proceedings of the 28th Annual IEEE/ACM International Symposium on Microachitecture, pages 231--236, Nov. 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Luk and T. C. Mowry. Compiler-based prefetching for recursive data structures. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 222--233, Cambridge, MA, Oct. 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C.-K. Luk. Tolerating memory latency through software-controlled pre-execution on simultaneous multithreading processors. In Proceedings of the 28th International Symposium on Computer Architecture, pages 40--51, June 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K. S. McKinley, S. Carr, and C. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424--453, July 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. T. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 62--73, Boston, MA, Oct. 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Palacharla and R. E. Kessler. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21th International Symposium on Computer Architecture, pages 24--33, Chicago, IL, Apr. 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Roth, A. Moshovos, and G. S. Sohi. Dependence based prefetching for linked data structures. In Proceeding of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 115--126, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Roth and G. Sohi. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 26th International Symposium on Computer Architecture, pages 111--121, Atlanta, GA, May 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. T. Sherwood, E. Perelman, and B. Calder. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, pages 3--14, Barcelona, Spain, Sept. 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. T. Sherwood, S. Sair, and B. Calder. Predictor-directed stream buffers. In Proceedings of the 33rd International Symposium on Microarchitecture, pages 42--53, Monterey, California, Dec. 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Skeppstedt and M. Dubois. Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps. In Proceedings of the 1997 International Conference on Parallel Processing, pages 298--307, Bloomington, IL, Aug. 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. J. Smith. Cache memories. Computing Surveys, 14(3):473--530, Sept. 1982.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Y. Solihin, J. Lee, and J. Torrellas. Using a user-level memory thread for correlation prefetching. In Proceedings of the 29th International Symposium on Computer Architecture, pages 171--182, May 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. Journal of Instruction Level Parallelism, 1:1--24, 1999.]]Google ScholarGoogle Scholar
  43. S. P. Vanderwiel and D. J. Lijia. A compiler-assisted data prefetch controller. In Proceedings of International Conference on Computer Design, Austin, TX, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 30--44, Toronto, Canada, June 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Y. Wu. Efficient discovery of regular stride patterns in irregular porgrams and its use in compiler prefetching. In Proceedings of the SIGPLAN 2002 Conference on Programming Language Design and Implementation, pages 210--221, Berlin, Germany, June 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. C.-L. Yang and A. R. Lebeck. Push vs. pull: Data movement for linked data structures. In Proceedings of the 1997 ACM International Conference on Supercomputing, pages 176--186, May 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Z. Zhang and T. Torrellas. Speeding up irregular applications in shared memory multiprocessors: Memory binding and group prefetching. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 1--19, Santa Margherita Ligure, Italy, June 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Guided region prefetching: a cooperative hardware/software approach

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ISCA '03: Proceedings of the 30th annual international symposium on Computer architecture
        June 2003
        432 pages
        ISBN:0769519458
        DOI:10.1145/859618
        • Conference Chair:
        • Allan Gottlieb,
        • Program Chair:
        • Kai Li
        • cover image ACM SIGARCH Computer Architecture News
          ACM SIGARCH Computer Architecture News  Volume 31, Issue 2
          ISCA 2003
          May 2003
          422 pages
          ISSN:0163-5964
          DOI:10.1145/871656
          Issue’s Table of Contents

        Copyright © 2003 Authors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 May 2003

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        ISCA '03 Paper Acceptance Rate36of184submissions,20%Overall Acceptance Rate543of3,203submissions,17%

        Upcoming Conference

        ISCA '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader