ABSTRACT
Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but is limited by prefetch instruction overheads and the compiler's limited ability to schedule prefetches sufficiently far in advance to cover level-two cache miss latencies. Hardware prefetching can be effective at hiding these large latencies, but generates many useless prefetches and consumes considerable memory bandwidth. In this paper, we propose a cooperative hardware-software prefetching scheme called Guided Region Prefetching (GRP), which uses compiler-generated hints encoded in load instructions to regulate an aggressive hardware prefetching engine. We compare GRP against a sophisticated pure hardware stride prefetcher and a scheduled region prefetching (SRP) engine. SRP and GRP show the best performance, with respective 22% and 21% gains over no prefetching, but SRP incurs 180% extra memory traffic---nearly tripling bandwidth requirements. GRP achieves performance close to SRP, but with a mere eighth of the extra prefetching traffic, a 23% increase over no prefetching. The GRP hardware-software collaboration thus combines the accuracy of compilerbased program analysis with the performance potential of aggressive hardware prefetching, bringing the performance gap versus a perfect L2 cache under 20%.
- T. Alexander and G. Kedem. Distributed predictive cache design for high performance memory system. In Second International Symposium on High Performance Computer Architecture, Feb. 1996.]] Google ScholarDigital Library
- M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 52--61, 2001.]] Google ScholarDigital Library
- Architecture and Language Implementation Group, University of Massachusetts, Amherst. Scale compiler infrastructure. In http://aliwww. cs.umass.edu/Scale.]]Google Scholar
- D. Burger and T. M. Austin. The simplescalar tool set version 2.0. Technical Report 1342, Computer Sciences Department, University of Wisconsin, June 1997.]]Google ScholarDigital Library
- B. Cahoon and K. S. McKinley. Data flow analysis for software prefetching linked data structures in Java. In Proceedings of the 2001 International Conference on Parallel Architectures and Compiler Techniques, pages 280--291, Barcelona, Spain, Sept. 2001.]] Google ScholarDigital Library
- B. Cahoon and K. S. McKinley. Simple and effective array prefetching for Java. In ACM Java Grande, pages 86--95, Seattle, WA, Nov. 2002.]] Google ScholarDigital Library
- D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 40--52, Santa Clara, CA, Apr. 1991.]] Google ScholarDigital Library
- S. Carr, K. S. McKinley, and C. Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252--262, San Jose, CA, Oct. 1994.]] Google ScholarDigital Library
- M. Charney and A. Reeves. Generalized correlation-based hardware prefetching. Technical Report EE CEG 95-1, Cornell University, Feb. 1995.]]Google Scholar
- T. Chen and J. Baer. Reducing memory latency via non-blocking and prefetching caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 51--61, Boston, MA, Oct. 1992.]] Google ScholarDigital Library
- T. Chen and J. Baer. Effective hardware based data prefetching. IEEE Transactions on Computers, 44(5):609--623, May 1995.]] Google ScholarDigital Library
- T.-F. Chen. An effective programmable prefetch engine for high-performance processors. In Proceedings of the 29th International Symposium on Microarchitecture, Ann Arbor, Michigan, Nov. 1995.]] Google ScholarDigital Library
- J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. P. Shen. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the 28th International Symposium on Computer Architecture, pages 14--25, June 2001.]] Google ScholarDigital Library
- R. Cooksey, S. Jordan, and D. Grunwald. A stateless, content-directed data prefetching mechanism. In Proceedings of the Tenth Annual International Conference on Architectural Support for Programming Languages and Operating Systems, pages 279--290, San Jose, CA, October 2002.]] Google ScholarDigital Library
- F. Dahlgren, M. Dubois, and P. Stenstrom. Fixed and adaptive sequential prefetching in shared-memory multiprocessors. In Proceedings of the 1993 International Conference on Parallel Processing, pages 56--63, St Charles, IL, 1993.]] Google ScholarDigital Library
- F. Dahlgren and P. Stenstrom. Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors. In First International Symposium on High Performance Computer Architecture, pages 68--77, Raleigh, NC, Jan. 1995.]] Google ScholarDigital Library
- S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228--239, San Jose, CA, Oct. 1998.]] Google ScholarDigital Library
- E. H. Gornish and A. V. Veidenbaum. An integrated hardware/software scheme for shared-memory multiprocessors. In Proceedings of the 1994 International Conference on Parallel Processing, pages 281--284, St Charles, IL, 1994.]] Google ScholarDigital Library
- C. J. Hughes and S. Adve. Memory-side prefetching for linked data structures. Technical Report UIUCDCS-R-2001-2221, University of Illinios, Urbana Champagne, May 2001.]] Google ScholarDigital Library
- J. Huh, D. Burger, and S. W. Keckler. Exploring the design space of future CMPs. In Proceedings of the 2001 International Conference on Parallel A rchitectures and Compilation Techniques, pages 199--210, Sep 2001.]] Google ScholarDigital Library
- D. Joseph and D. Grunwald. Prefetching using Markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 252--263, 1997.]] Google ScholarDigital Library
- N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th International Symposium on Computer Architecture, pages 364--373, Seattle, WA, June 1990.]] Google ScholarDigital Library
- M. Karlsson, F. Dahlgren, and P. Sternstrom. A prefetching technique for irregular accesses to linked data structures. In Sixth International Symposium on High Performance Computer Architecture, page 206, Toulouse, France, Jan. 2000.]]Google Scholar
- D. Kim and D. Yeung. Design and evaluation of compiler algorithms for pre-execution. In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 159--170, San Jose, CA, Oct. 2002.]] Google ScholarDigital Library
- A. C. Klaiber and H. M. Levy. An architecture for software-controlled data prefetching. In Proceedings of the 18th International Symposium on Computer Architecture, pages 43--53, Toronto, Canada, May 1991.]] Google ScholarDigital Library
- A.-C. Lai, C. Fide, and B. Falsafi. Dead-block prediction and deadblock correlating prefetchers. In Proceedings of the 28th International Symposium on Computer Architecture, July 2001.]] Google ScholarDigital Library
- K.-F. Lee, H.-W. Hon, and R. Reddy. An overview of the SPHINX speech recoginition system. In IEEE Transactions on Acoustics, Speech and Signal Precessing, volume 38(1), pages 35--44, 1990.]]Google ScholarCross Ref
- W.-F. Lin, S. K. Reinhardt, and D. Burger. Reducing DRAM latencies with an integrated memory hierarchy design. In Proceedings of the 7th International Symposium on High Performance Computer Architecture, pages 301--312, Jan 2001.]] Google ScholarDigital Library
- M. H. Lipasti, W. J. Schmidt, S. R. Kunkel, and R. R. Roediger. SPAID: Software prefetching in pointer- and call-intensive environments. In Proceedings of the 28th Annual IEEE/ACM International Symposium on Microachitecture, pages 231--236, Nov. 1995.]] Google ScholarDigital Library
- C. Luk and T. C. Mowry. Compiler-based prefetching for recursive data structures. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 222--233, Cambridge, MA, Oct. 1996.]] Google ScholarDigital Library
- C.-K. Luk. Tolerating memory latency through software-controlled pre-execution on simultaneous multithreading processors. In Proceedings of the 28th International Symposium on Computer Architecture, pages 40--51, June 2001.]] Google ScholarDigital Library
- K. S. McKinley, S. Carr, and C. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424--453, July 1996.]] Google ScholarDigital Library
- T. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 62--73, Boston, MA, Oct. 1992.]] Google ScholarDigital Library
- S. Palacharla and R. E. Kessler. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21th International Symposium on Computer Architecture, pages 24--33, Chicago, IL, Apr. 1994.]] Google ScholarDigital Library
- A. Roth, A. Moshovos, and G. S. Sohi. Dependence based prefetching for linked data structures. In Proceeding of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 115--126, 1998.]] Google ScholarDigital Library
- A. Roth and G. Sohi. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 26th International Symposium on Computer Architecture, pages 111--121, Atlanta, GA, May 1999.]] Google ScholarDigital Library
- T. Sherwood, E. Perelman, and B. Calder. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, pages 3--14, Barcelona, Spain, Sept. 2001.]] Google ScholarDigital Library
- T. Sherwood, S. Sair, and B. Calder. Predictor-directed stream buffers. In Proceedings of the 33rd International Symposium on Microarchitecture, pages 42--53, Monterey, California, Dec. 2000.]] Google ScholarDigital Library
- J. Skeppstedt and M. Dubois. Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps. In Proceedings of the 1997 International Conference on Parallel Processing, pages 298--307, Bloomington, IL, Aug. 1997.]] Google ScholarDigital Library
- A. J. Smith. Cache memories. Computing Surveys, 14(3):473--530, Sept. 1982.]] Google ScholarDigital Library
- Y. Solihin, J. Lee, and J. Torrellas. Using a user-level memory thread for correlation prefetching. In Proceedings of the 29th International Symposium on Computer Architecture, pages 171--182, May 2002.]] Google ScholarDigital Library
- S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. Journal of Instruction Level Parallelism, 1:1--24, 1999.]]Google Scholar
- S. P. Vanderwiel and D. J. Lijia. A compiler-assisted data prefetch controller. In Proceedings of International Conference on Computer Design, Austin, TX, 1999.]] Google ScholarDigital Library
- M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 30--44, Toronto, Canada, June 1991.]] Google ScholarDigital Library
- Y. Wu. Efficient discovery of regular stride patterns in irregular porgrams and its use in compiler prefetching. In Proceedings of the SIGPLAN 2002 Conference on Programming Language Design and Implementation, pages 210--221, Berlin, Germany, June 2002.]] Google ScholarDigital Library
- C.-L. Yang and A. R. Lebeck. Push vs. pull: Data movement for linked data structures. In Proceedings of the 1997 ACM International Conference on Supercomputing, pages 176--186, May 2000.]] Google ScholarDigital Library
- Z. Zhang and T. Torrellas. Speeding up irregular applications in shared memory multiprocessors: Memory binding and group prefetching. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 1--19, Santa Margherita Ligure, Italy, June 1995.]] Google ScholarDigital Library
- Guided region prefetching: a cooperative hardware/software approach
Recommendations
Guided region prefetching: a cooperative hardware/software approach
ISCA 2003Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically ...
Execution History Guided Instruction Prefetching
The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-...
Stealth prefetching
Proceedings of the 2006 ASPLOS ConferencePrefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching ...
Comments