Abstract
This paper presents a simple but effective method to reduce on-chip access latency and improve core isolation in CMP Non-Uniform Cache Architectures (NUCA). The paper introduces a feasible way to allocate cache blocks according to the access pattern. Each L2 bank is dynamically partitioned at set level in private and shared content. Simply by adjusting the replacement algorithm, we can place private data closer to its owner processor. In contrast, independently of the accessing processor, shared data is always placed in the same position. This approach is capable of reducing on-chip latency without significantly sacrificing hit rates or increasing implementation cost of a conventional static NUCA. Additionally, most of the unnecessary interference between cores in private accesses is removed.
To support the architectural decisions adopted and provide a comparative study, a comprehensive evaluation framework is employed. The workbench is composed of a full system simulator, and a representative set of multithreaded and multiprogrammed workloads. With this infrastructure, different alternatives for the coherence protocol, replacement policies, and cache utilization are analyzed to find the optimal proposal. We conclude that the cost for a feasible implementation should be closer to a conventional static NUCA, and significantly less than a dynamic NUCA.
Finally, a comparison with static and dynamic NUCA is presented. The simulation results suggest that on average the mechanism proposed could improve system performance of a static NUCA and idealized dynamic NUCA by 16% and 6% respectively.
- B. M. Beckmann and D. A. Wood, "Managing wire delay in large chip-multiprocessor caches", MICRO 37, 2004. Google ScholarDigital Library
- B. M. Beckmann, M. R. Marty, D. A. Wood, "ASR: Adaptive Selective Replication for CMP Caches", MICRO 2006. Google ScholarDigital Library
- J. Chang and G. S. Sohi, "Cooperative caching for chip multiprocessors", ISCA, 2006. Google ScholarDigital Library
- Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimizing replication, communication, and capacity allocation in CMPs", ISCA, 2005. Google ScholarDigital Library
- H. Dybdahl and P. Stenström, "An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors", HPCA 2007. Google ScholarDigital Library
- J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, S. W. Keckler, "A NUCA Substrate for Flexible CMP Cache Sharing", IEEE Trans. Parallel Distrib. Syst, vol.18, no.8, pp: 1028--1040, September 2007. Google ScholarDigital Library
- R. Iyer, "CQoS: a Framework for Enabling QoS in Shared Caches of CMP Platforms", ICS 2004. Google ScholarDigital Library
- I. T. R. for Semiconductors. ITRS 2005 Update. Semiconductor Industry Association, 2005.Google Scholar
- H. Jin, M. Frumkin, J. Yan; "The OpenMP Implementation of NAS Parallel Benchmarks and its Performance", NAS Technical Report NAS-99-011, NASA Ames Research Center, Moffett Field, CA, 1999.Google Scholar
- C. Kim, D. Burger and, S. W. Keckler, "An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches". ASPLOS X, pp. 211--222, October 2002. Google ScholarDigital Library
- S. Kim, D. Chandra, and Y. Solihin, "Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture". PACT 2004. Google ScholarDigital Library
- D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim, "LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies". IEEE Trans. Computers, vol. 50, no. 12, pp 1352--1361, December 2001 Google ScholarDigital Library
- N. Megiddo and D. S. Modha, "ARC: A Self-Tuning, Low Overhead Replacement Cache," Proc. Usenix Conf. File and Storage Technologies (FAST 2003), Usenix, 2003, pp. 115--130 Google ScholarDigital Library
- P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, F. Larsson, A. Moestedt, B. Werner, "Simics: A Full System Simulation Platform". Computer, Vol. 35, No.2, pp. 50--58, February 2002. Google ScholarDigital Library
- M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, D. Wood, "Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset", SIGARCH Comput. Archit. News, Vol.33, No.4, pp.92--99, November 2005. Google ScholarDigital Library
- M. K. Martin, M. D. Hill, and D. A. Wood, "Token Coherence: Decoupling Performance and Correctness", ISCA 2003. Google ScholarDigital Library
- C. J. Mauer, M. D. Hill, D. A. Wood, "Full-system timing-first simulation", SIGMETRICS 2002: 108--116. Google ScholarDigital Library
- Michael R. Marty, Jesse D. Bingham, Mark D. Hill, Alan J. Hu, Milo M. K. Martin, David A. Wood, "Improving Multiple-CMP Systems Using Token Coherence," hpca, pp. 328--339, 11th International Symposium on High-Performance Computer Architecture (HPCA'05), 2005 Google ScholarDigital Library
- M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. "A case for MLP-aware cache replacement". ISCA, 2006. Google ScholarDigital Library
- SPEC2000, http://www.spec.org/cpu2000/Google Scholar
- H. S. Stone, J. Turek, J. L. Wolf, "Optimal Partitioning of Cache Memory", IEEE Trans. Computers vol. 41, no 9, pp 1054--1068, September 1992. Google ScholarDigital Library
- G. Suh, S. Devadas, and L. Rudolph. "Dynamic cache partitioning for simultaneous multithreading systems". IASTED Int. Conf. on Parallel and Distributed Computing Systems, 2001Google Scholar
- G. E. Suh, S. Devadas, and L. Rudolph, "A new memory monitoring scheme for memory-aware scheduling and partitioning", HPCA, 2002. Google ScholarDigital Library
- S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi. CACTI 5.0: An Integrated Cache Timing, Power, and AreaModel. Technical report, HP Laboratories Palo Alto, 2007.Google Scholar
- M. Zhang and K. Asanovic, "Victim replication: Maximizing capacity while hiding wire delay in tiled chipmultiprocessors", ISCA, 2005. Google ScholarDigital Library
- L. Zhao, R. Iyer, M. Upton, D. Newell, "Towards Hybrid Last Level Caches for Chip-Multiprocessors", dasCMP 2007. Google ScholarDigital Library
Index Terms
- SP-NUCA: a cost effective dynamic non-uniform cache architecture
Recommendations
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureIncreases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Way adaptable D-NUCA caches
Non-uniform cache architecture (NUCA) aims to limit the wire-delay problem typical of large on-chip last level caches: by partitioning a large cache into several banks, with the latency of each one depending on its physical location and by employing a ...
Comments