ABSTRACT
This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large-scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets' usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. Simulation results using a full-system simulator demonstrate that CE achieves an average L2 miss rate reduction of 13.6% over a shared NUCA scheme and by as much as 46.7% for the benchmark programs we examined. Furthermore, evaluations showed that CE outperforms related cache designs.
- M. Awasthi, K. Sudan, R. Balasubramonian, J. Carter. "Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches," HPCA, Feb. 2009.Google Scholar
- B. M. Beckmann, M. R. Marty, and D. A. Wood. "ASR: Adaptive Selective Replication for CMP Caches," MICRO, Dec. 2006. Google ScholarDigital Library
- B. M. Beckmann and D. A. Wood. "Managing Wire Delay in Large Chip-Multiprocessor Caches," MICRO, Dec. 2004. Google ScholarDigital Library
- C. M. Bienia, S. Kumar, J. P. Singh, and K. Li. "The PARSEC Benchmark Suite: Characterization and Architectural Implications," PACT, Oct. 2008. Google ScholarDigital Library
- J. Chang and G. S. Sohi. "Cooperative Caching for Chip Multiprocessors," ISCA, June 2006. Google ScholarDigital Library
- M. Chaudhuri. "PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared Chip-multiprocessor Caches," HPCA, Feb. 2009.Google Scholar
- Z. Chishti, M. D. Powell, and T. N. Vijaykumar. "Optimizing Replication, Communication, and Capacity Allocation in CMPs," ISCA, June 2005. Google ScholarDigital Library
- S. Cho and L. Jin "Managing Distributed Shared L2 Caches through OS-Level Page Allocation," MICRO, Dec 2006. Google ScholarDigital Library
- Z. Guz, I. Keidar, A. Kolodny, U. C. Weiser. "Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture," SPAA, June 2008. Google ScholarDigital Library
- M. Hammoud, S. Cho, and R. Melhem. "A Dynamic Pressure-Aware Associative Placement Strategy for Large Scale Chip Multiprocessors," Computer Architecture Letters, May 2010. Google ScholarDigital Library
- M. Hammoud, S. Cho, and R. Melhem. "ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors," HiPEAC, Jan. 2009. Google ScholarDigital Library
- N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. "Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches," ISCA, June 2009. Google ScholarDigital Library
- HP Labs. "http://www.hpl.hp.com/research/cacti/"Google Scholar
- J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. "A NUCA Substrate for Flexible CMP Cache Sharing," ICS, June 2005. Google ScholarDigital Library
- L. Jin and S. Cho. "Taming Single-Thread Program Performance on Many Distributed On-Chip L2 Caches," ICPP, September 2008. Google ScholarDigital Library
- N. P. Jouppi. "Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers," ISCA, 1990. Google ScholarDigital Library
- M. Kandemir, F. Li, M. J. Irwin, and S. W. Son. "A Novel Migration-Based NUCA Design for Chip Multiprocessors," Proc. HiPC, Nov. 2008. Google ScholarDigital Library
- C. Kim, D. Burger, and S. W. Keckler. "An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches," ASPLOS, Oct. 2002. Google ScholarDigital Library
- P. Kongetira, K. Aingaran, and K. Olukotun. "Niagara: A 32-Way Multithreaded Sparc Processor," IEEE Micro, March--April 2005. Google ScholarDigital Library
- G. Memik, G. Reinman, and W. H. Mangione-Smith. "Reducing Energy and Delay Using Efficient Victim Caches," ISLPED, 2003. Google ScholarDigital Library
- K. Olukotun, L. Hammond, and J. Laudon. "Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency," Synthesis Lectures on Computer Arch, 1st Ed., Morgan and Claypool, Dec. 2007. Google ScholarDigital Library
- M. K. Qureshi. "Adaptive Spill-Receive for Robust High-Performance Caching in CMPs," HPCA, Feb. 2009.Google Scholar
- Research at Intel. "Introducing the 45nm Next-Generation Intel Core#8482; Microarchitecture," White Paper.Google Scholar
- A. Ros, M. E. Acacio, and J. M. García "Scalable Directory Organization for Tiled CMP Architectures," ICCAD, July 2008.Google Scholar
- T. Sherwood, B. Calder, and J. Emer. "Reducing CacheMisses Using Hardware and Software Page Placement," ICS, June 1999. Google ScholarDigital Library
- B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. "POWER5 System Microarchitecture," IBM J. Res. & Dev., July. 2005. Google ScholarDigital Library
- S. Srikantaiah, M. Kandemir, and M. J. Irwin. "Adaptive Set Pinning: Managing Shared Caches in Chip Multiprocessors," ASPLOS, March 2008. Google ScholarDigital Library
- S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. "Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers," HPCA, Feb. 2007. Google ScholarDigital Library
- Standard Performance Evaluation Corporation. http://www.specbench.org.Google Scholar
- D. Tam, R. Azimi, L. Soares, and M. Stumm. "Managing Shared L2 Caches on Multicore Systems in Software," WIOSCA, 2007.Google Scholar
- N. Topham, A. Gonzalez, and J. Gonzalez. "The Design and Performance of a Conflict-Avoiding Cache," MICRO, 1997. Google ScholarDigital Library
- H. Vandierendonck, P. Manet, and J.-D. Legat. "Application-Specific Reconfigurable XOR-Indexing To Eliminate Cache Conflict Misses," DATE, 2006. Google ScholarDigital Library
- Virtutech AB. Simics Full System Simulator "http://www.simics.com/"Google Scholar
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. "The SPLASH-2 Programs: Characterization and Methodological Considerations," ISCA, July 1995. Google ScholarDigital Library
- C. Zhang. "Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches," ISCA, June 2006. Google ScholarDigital Library
- M. Zhang and K. Asanović. "Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors," ISCA, June 2005. Google ScholarDigital Library
Index Terms
- Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches
Recommendations
Reactive NUCA: near-optimal block placement and replication in distributed caches
Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureIncreases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Towards hybrid last level caches for chip-multiprocessors
As CMP platforms are widely adopted, more and more cores are integrated on to the die. To reduce the off-chip memory access, the last level cache is usually organized as a distributed shared cache. In order to avoid hot-spots, cache lines are ...
Comments