ABSTRACT
Cache partitioning has a wide range of uses in CMPs, from guaranteeing quality of service and controlled sharing to security-related techniques. However, existing cache partitioning schemes (such as way-partitioning) are limited to coarse-grain allocations, can only support few partitions, and reduce cache associativity, hurting performance. Hence, these techniques can only be applied to CMPs with 2-4 cores, but fail to scale to tens of cores.
We present Vantage, a novel cache partitioning technique that overcomes the limitations of existing schemes: caches can have tens of partitions with sizes specified at cache line granularity, while maintaining high associativity and strong isolation among partitions. Vantage leverages cache arrays with good hashing and associativity, which enable soft-pinning a large portion of cache lines. It enforces capacity allocations by controlling the replacement process. Unlike prior schemes, Vantage provides strict isolation guarantees by partitioning most (e.g. 90%) of the cache instead of all of it. Vantage is derived from analytical models, which allow us to provide strong guarantees and bounds on associativity and sizing independent of the number of partitions and their behaviors. It is simple to implement, requiring around 1.5% state overhead and simple changes to the cache controller.
We evaluate Vantage using extensive simulations. On a 32-core system, using 350 multiprogrammed workloads and one partition per core, partitioning the last-level cache with conventional techniques degrades throughput for 71% of the workloads versus an unpartitioned cache (by 7% average, 25% maximum degradation), even when using 64-way caches. In contrast, Vantage improves throughput for 98% of the workloads, by 8% on average (up to 20%), using a 4-way cache.
Supplemental Material
- J. L. Carter and M. N. Wegman. Universal classes of hash functions (extended abstract). In Proc. of the 9th annual ACM Symposium on Theory of Computing, 1977. Google ScholarDigital Library
- L. Ceze, J. Tuck, J. Torrellas, and C. Cascaval. Bulk disambiguation of speculative threads in multiprocessors. In Proc. of the 33rd annual Intl. Symp. on Computer Architecture, 2006. Google ScholarDigital Library
- D. Chiou, P. Jain, S. Devadas, and L. Rudolph. Dynamic cache partitioning via columnization. In Proc. of the 37th annual Design Automation Conf., 2000.Google Scholar
- D. Chiou, P. Jain, L. Rudolph, and S. Devadas. Application-specific memory management for embedded systems using software-controlled caches. In Proc. of the 37th annual Design Automation Conf., 2000. Google ScholarDigital Library
- H. Cook, K. Asanović, and D. A. Patterson. Virtual local stores: Enabling software-managed memory hierarchies in mainstream computing environments. Technical report, EECS Department, U. of California, Berkeley, 2009.Google Scholar
- G. Gerosa et al. A sub-1W to 2W low-power IA processor for mobile internet devices and ultra-mobile PCs in 45nm hi-K metal gate CMOS. In IEEE Intl. Solid-State Circuits Conf., 2008.Google Scholar
- F. Guo, H. Kannan, L. Zhao, R. Illikkal, R. Iyer, D. Newell, Y. Solihin, and C. Kozyrakis. From Chaos to QoS: Case Studies in CMP Resource Management. ACM SIGARCH Computer Architecture News, 35(1), 2007. Google ScholarDigital Library
- L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. Transactional memory coherence and consistency. In Proc. of the 31st annual Intl. Symp. on Computer Architecture. 2004. Google ScholarDigital Library
- L. Hsu, S. Reinhardt, R. Iyer, and S. Makineni. Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource. In Proc. of the 15th intl. conf. on Parallel Architectures and Compilation Techniques, 2006. Google ScholarDigital Library
- R. Iyer. CQoS: A framework for enabling QoS in shared caches of CMP platforms. In Proc. of the 18th annual intl. conf. on Supercomputing, 2004. Google ScholarDigital Library
- A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and J. Emer. Adaptive insertion policies for managing shared caches. In Proc. of the 17th intl. conf. on Parallel Architectures and Compilation Techniques, 2008. Google ScholarDigital Library
- A. Jaleel, K. Theobald, S. C. S. Jr, and J. Emer. High performance cache replacement using re-reference interval prediction (RRIP). In Proc. of the 37th annual Intl. Symp. on Computer Architecture, 2010. Google ScholarDigital Library
- N. Kurd et al. Westmere: A family of 32nm IA processors. In IEEE Intl. Solid-State Circuits Conf., 2010.Google Scholar
- J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proc. of the 14th IEEE intl. symp. on High Performance Computer Architecture, 2008.Google Scholar
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proc. of the ACM SIGPLAN conf. on Programming Language Design and Implementation, 2005. Google ScholarDigital Library
- V. Nagarajan and R. Gupta. ECMon: exposing cache events for monitoring. In Proc. of the 36th annual Intl. Symp. on Computer Architecture, 2009. Google ScholarDigital Library
- C. Percival. Cache missing for fun and profit. BSDCan, 2005.Google Scholar
- M. Qureshi. Adaptive spill-receive for robust high-performance caching in cmps. In Proc. of the 10th intl. symp. on High Performance Computer Architecture, 2009.Google ScholarCross Ref
- M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. of the 39th annual IEEE/ACM intl. symp. on Microarchitecture, 2006. Google ScholarDigital Library
- P. Ranganathan, S. Adve, and N. Jouppi. Reconfigurable caches and their application to media processing. In Proc. of the 27th annual Intl. Symp. on Computer Architecture, 2000. Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis. The ZCache: Decoupling Ways and Associativity. In Proc. of the 43rd annual IEEE/ACM intl. symp. on Microarchitecture, 2010. Google ScholarDigital Library
- A. Seznec. A case for two-way skewed-associative caches. In Proc. of the 20th annual Intl. Symp. on Computer Architecture, 1993. Google ScholarDigital Library
- J. Shin et al. A 40nm 16-core 128-thread CMT SPARC SoC processor. In Intl. Solid-State Circuits Conf., 2010.Google Scholar
- G. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proc of the 8th IEEE intl. symp. on High Performance Computer Architecture, 2002. Google ScholarDigital Library
- K. Varadarajan, S. Nandy, V. Sharda, A. Bharadwaj, R. Iyer, S. Makineni, and D. Newell. Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions. In Proc. of the 39th annual IEEE/ACM intl. symp. on Microarchitecture, 2006. Google ScholarDigital Library
- C. Wu and M. Martonosi. A Comparison of Capacity Management Schemes for Shared CMP Caches. In Proc. of the 7th Workshop on Duplicating, Deconstructing, and Debunking, 2008.Google Scholar
- Y. Xie and G. H. Loh. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches. In Proc. of the 36th annual Intl. Symp. on Computer Architecture, 2009. Google ScholarDigital Library
Index Terms
- Vantage: scalable and efficient fine-grain cache partitioning
Recommendations
Vantage: scalable and efficient fine-grain cache partitioning
ISCA '11Cache partitioning has a wide range of uses in CMPs, from guaranteeing quality of service and controlled sharing to security-related techniques. However, existing cache partitioning schemes (such as way-partitioning) are limited to coarse-grain ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureIncreases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Comments