|
ABSTRACT
Classic cache replacement policies assume that miss costs are uniform. However, the correlation between miss rate and cache performance is not as straightforward as it used to be. Ultimately, the true cost measure of a miss should be the penalty, i.e. the actual processing bandwidth lost because of the miss. It is known that, contrary to loads, the penalty of stores is mostly hidden in modern processors. To take advantage of this observation, we propose simple schemes to replace load misses by store misses. We extend classic replacement algorithms such as LRU (Least Recently Used) and PLRU (Partial LRU) to reduce the aggregate miss penalty instead of the miss count.One key issue is to predict the next access type to a block, so that higher replacement priority is given to blocks that will be accessed next with a store. We introduce and evaluate various prediction schemes based on instructions, and broadly inspired from branch predictors. To guide the design we run extensive trace-driven simulations on eight Spec95 benchmarks with a wide range of cache configurations and observe that our simple penalty-sensitive policies yield positive load miss improvements over classic algorithms across most the benchmarks and cache configurations. In some cases the improvements are very large.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Santosh G. Abraham , Rabin A. Sugumar , Daniel Windheiser , B. R. Rau , Rajiv Gupta, Predictability of load/store instruction latencies, Proceedings of the 26th annual international symposium on Microarchitecture, p.139-152, December 01-03, 1993, Austin, Texas, United States
|
| |
2
|
Burger, D., and Austin, T., The SimpleScalar Tool Set, Version 2.0. Computer Sciences Dept. Tech. Report #1342, Univ. of Wisconsin-Madison, June 1997.
|
| |
3
|
|
| |
4
|
|
 |
5
|
John W. C. Fu , Janak H. Patel , Bob L. Janssens, Stride directed prefetching in scalar processors, Proceedings of the 25th annual international symposium on Microarchitecture, p.102-110, December 01-04, 1992, Portland, Oregon, United States
|
 |
6
|
|
 |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
Teresa L. Johnson , Matthew C. Merten , Wen-Mei W. Hwu, Run-time spatial locality detection and optimization, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.57-64, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
 |
11
|
|
| |
12
|
|
 |
13
|
An-Chow Lai , Babak Falsafi, Selective, accurate, and timely self-invalidation using last-touch prediction, Proceedings of the 27th annual international symposium on Computer architecture, p.139-148, June 2000, Vancouver, British Columbia, Canada
|
| |
14
|
|
| |
15
|
|
 |
16
|
|
| |
17
|
Seznec, A., and Lloansi, F., About Effective Cache Miss Penalty on Out-of-Order Superscalar Processors. IRISA Report #970, Nov. 1995.
|
| |
18
|
|
 |
19
|
|
| |
20
|
|
 |
21
|
Srikanth T. Srinivasan , Roy Dz-ching Ju , Alvin R. Lebeck , Chris Wilkerson, Locality vs. criticality, Proceedings of the 28th annual international symposium on Computer architecture, p.132-143, June 30-July 04, 2001, Göteborg, Sweden
|
| |
22
|
Standard Performance Evaluation Corporation, http:// www.specbench.org.
|
 |
23
|
Per Stenström , Mats Brorsson , Lars Sandberg, An adaptive cache coherence protocol optimized for migratory sharing, Proceedings of the 20th annual international symposium on Computer architecture, p.109-118, May 16-19, 1993, San Diego, California, United States
|
| |
24
|
|
| |
25
|
Gary Tyson , Matthew Farrens , John Matthews , Andrew R. Pleszkun, A modified approach to data cache management, Proceedings of the 28th annual international symposium on Microarchitecture, p.93-103, November 29-December 01, 1995, Ann Arbor, Michigan, United States
|
| |
26
|
Wong, W., and Baer, J., Modified LRU Policies for Improving Second-Level Cache Behavior. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture, Jan. 2000, 49--60.
|
 |
27
|
|
|