research-article

Vantage: scalable and efficient fine-grain cache partitioning

Authors:
Daniel Sanchez

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
Christos Kozyrakis

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

ISCA '11: Proceedings of the 38th annual international symposium on Computer architectureJune 2011Pages 57–68https://doi.org/10.1145/2000064.2000073

Published:04 June 2011Publication History

ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

Pages 57–68

ABSTRACT

Cache partitioning has a wide range of uses in CMPs, from guaranteeing quality of service and controlled sharing to security-related techniques. However, existing cache partitioning schemes (such as way-partitioning) are limited to coarse-grain allocations, can only support few partitions, and reduce cache associativity, hurting performance. Hence, these techniques can only be applied to CMPs with 2-4 cores, but fail to scale to tens of cores.

We present Vantage, a novel cache partitioning technique that overcomes the limitations of existing schemes: caches can have tens of partitions with sizes specified at cache line granularity, while maintaining high associativity and strong isolation among partitions. Vantage leverages cache arrays with good hashing and associativity, which enable soft-pinning a large portion of cache lines. It enforces capacity allocations by controlling the replacement process. Unlike prior schemes, Vantage provides strict isolation guarantees by partitioning most (e.g. 90%) of the cache instead of all of it. Vantage is derived from analytical models, which allow us to provide strong guarantees and bounds on associativity and sizing independent of the number of partitions and their behaviors. It is simple to implement, requiring around 1.5% state overhead and simple changes to the cache controller.

We evaluate Vantage using extensive simulations. On a 32-core system, using 350 multiprogrammed workloads and one partition per core, partitioning the last-level cache with conventional techniques degrades throughput for 71% of the workloads versus an unpartitioned cache (by 7% average, 25% maximum degradation), even when using 64-way caches. In contrast, Vantage improves throughput for 98% of the workloads, by 8% on average (up to 20%), using a 4-way cache.

Supplemental Material

isca_3a_1.mp4

mp4

134.2 MB

Download

References

J. L. Carter and M. N. Wegman. Universal classes of hash functions (extended abstract). In Proc. of the 9th annual ACM Symposium on Theory of Computing, 1977. Google ScholarDigital Library
L. Ceze, J. Tuck, J. Torrellas, and C. Cascaval. Bulk disambiguation of speculative threads in multiprocessors. In Proc. of the 33rd annual Intl. Symp. on Computer Architecture, 2006. Google ScholarDigital Library
D. Chiou, P. Jain, S. Devadas, and L. Rudolph. Dynamic cache partitioning via columnization. In Proc. of the 37th annual Design Automation Conf., 2000.Google Scholar
D. Chiou, P. Jain, L. Rudolph, and S. Devadas. Application-specific memory management for embedded systems using software-controlled caches. In Proc. of the 37th annual Design Automation Conf., 2000. Google ScholarDigital Library
H. Cook, K. Asanović, and D. A. Patterson. Virtual local stores: Enabling software-managed memory hierarchies in mainstream computing environments. Technical report, EECS Department, U. of California, Berkeley, 2009.Google Scholar
G. Gerosa et al. A sub-1W to 2W low-power IA processor for mobile internet devices and ultra-mobile PCs in 45nm hi-K metal gate CMOS. In IEEE Intl. Solid-State Circuits Conf., 2008.Google Scholar
F. Guo, H. Kannan, L. Zhao, R. Illikkal, R. Iyer, D. Newell, Y. Solihin, and C. Kozyrakis. From Chaos to QoS: Case Studies in CMP Resource Management. ACM SIGARCH Computer Architecture News, 35(1), 2007. Google ScholarDigital Library
L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. Transactional memory coherence and consistency. In Proc. of the 31st annual Intl. Symp. on Computer Architecture. 2004. Google ScholarDigital Library
L. Hsu, S. Reinhardt, R. Iyer, and S. Makineni. Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource. In Proc. of the 15th intl. conf. on Parallel Architectures and Compilation Techniques, 2006. Google ScholarDigital Library
R. Iyer. CQoS: A framework for enabling QoS in shared caches of CMP platforms. In Proc. of the 18th annual intl. conf. on Supercomputing, 2004. Google ScholarDigital Library
A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and J. Emer. Adaptive insertion policies for managing shared caches. In Proc. of the 17th intl. conf. on Parallel Architectures and Compilation Techniques, 2008. Google ScholarDigital Library
A. Jaleel, K. Theobald, S. C. S. Jr, and J. Emer. High performance cache replacement using re-reference interval prediction (RRIP). In Proc. of the 37th annual Intl. Symp. on Computer Architecture, 2010. Google ScholarDigital Library
N. Kurd et al. Westmere: A family of 32nm IA processors. In IEEE Intl. Solid-State Circuits Conf., 2010.Google Scholar
J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proc. of the 14th IEEE intl. symp. on High Performance Computer Architecture, 2008.Google Scholar
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proc. of the ACM SIGPLAN conf. on Programming Language Design and Implementation, 2005. Google ScholarDigital Library
V. Nagarajan and R. Gupta. ECMon: exposing cache events for monitoring. In Proc. of the 36th annual Intl. Symp. on Computer Architecture, 2009. Google ScholarDigital Library
C. Percival. Cache missing for fun and profit. BSDCan, 2005.Google Scholar
M. Qureshi. Adaptive spill-receive for robust high-performance caching in cmps. In Proc. of the 10th intl. symp. on High Performance Computer Architecture, 2009.Google ScholarCross Ref
M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. of the 39th annual IEEE/ACM intl. symp. on Microarchitecture, 2006. Google ScholarDigital Library
P. Ranganathan, S. Adve, and N. Jouppi. Reconfigurable caches and their application to media processing. In Proc. of the 27th annual Intl. Symp. on Computer Architecture, 2000. Google ScholarDigital Library
D. Sanchez and C. Kozyrakis. The ZCache: Decoupling Ways and Associativity. In Proc. of the 43rd annual IEEE/ACM intl. symp. on Microarchitecture, 2010. Google ScholarDigital Library
A. Seznec. A case for two-way skewed-associative caches. In Proc. of the 20th annual Intl. Symp. on Computer Architecture, 1993. Google ScholarDigital Library
J. Shin et al. A 40nm 16-core 128-thread CMT SPARC SoC processor. In Intl. Solid-State Circuits Conf., 2010.Google Scholar
G. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proc of the 8th IEEE intl. symp. on High Performance Computer Architecture, 2002. Google ScholarDigital Library
K. Varadarajan, S. Nandy, V. Sharda, A. Bharadwaj, R. Iyer, S. Makineni, and D. Newell. Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions. In Proc. of the 39th annual IEEE/ACM intl. symp. on Microarchitecture, 2006. Google ScholarDigital Library
C. Wu and M. Martonosi. A Comparison of Capacity Management Schemes for Shared CMP Caches. In Proc. of the 7th Workshop on Duplicating, Deconstructing, and Debunking, 2008.Google Scholar
Y. Xie and G. H. Loh. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches. In Proc. of the 36th annual Intl. Symp. on Computer Architecture, 2009. Google ScholarDigital Library

Index Terms

Vantage: scalable and efficient fine-grain cache partitioning
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Vantage: scalable and efficient fine-grain cache partitioning
ISCA '11

Cache partitioning has a wide range of uses in CMPs, from guaranteeing quality of service and controlled sharing to security-related techniques. However, existing cache partitioning schemes (such as way-partitioning) are limited to coarse-grain ...
Read More
Reactive NUCA: near-optimal block placement and replication in distributed caches

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Read More
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture
June 2011
488 pages
ISBN:9781450304726
DOI:10.1145/2000064
General Chairs:
Ravi Iyer
Intel
,
Qing Yang
University of Rhode Island
,
Program Chair:
Antonio González
Intel and UPC
ACM SIGARCH Computer Architecture News Volume 39, Issue 3
ISCA '11
June 2011
462 pages
ISSN:0163-5964
DOI:10.1145/2024723
Issue’s Table of Contents
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 June 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache partitioning
multi-core
qos
shared cache
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 184
  Total Citations
  View Citations
- 1,532
  Total Downloads
- Downloads (Last 12 months)102
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Vantage: scalable and efficient fine-grain cache partitioning

ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Vantage: scalable and efficient fine-grain cache partitioning

Reactive NUCA: near-optimal block placement and replication in distributed caches

Reactive NUCA: near-optimal block placement and replication in distributed caches