research-article

Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches

Authors:
Mohammad Hammoud

University of Pittsburgh, Pittsburgh, PA

University of Pittsburgh, Pittsburgh, PA
View Profile

,
Sangyeun Cho

University of Pittsburgh, Pittsburgh, PA

University of Pittsburgh, Pittsburgh, PA
View Profile

,
Rami G. Melhem

University of Pittsburgh, Pittsburgh, PA

University of Pittsburgh, Pittsburgh, PA
View Profile

HiPEAC '11: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and CompilersJanuary 2011Pages 177–186https://doi.org/10.1145/1944862.1944889

Published:24 January 2011Publication History

HiPEAC '11: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers

Pages 177–186

ABSTRACT

This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large-scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets' usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. Simulation results using a full-system simulator demonstrate that CE achieves an average L2 miss rate reduction of 13.6% over a shared NUCA scheme and by as much as 46.7% for the benchmark programs we examined. Furthermore, evaluations showed that CE outperforms related cache designs.

References

M. Awasthi, K. Sudan, R. Balasubramonian, J. Carter. "Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches," HPCA, Feb. 2009.Google Scholar
B. M. Beckmann, M. R. Marty, and D. A. Wood. "ASR: Adaptive Selective Replication for CMP Caches," MICRO, Dec. 2006. Google ScholarDigital Library
B. M. Beckmann and D. A. Wood. "Managing Wire Delay in Large Chip-Multiprocessor Caches," MICRO, Dec. 2004. Google ScholarDigital Library
C. M. Bienia, S. Kumar, J. P. Singh, and K. Li. "The PARSEC Benchmark Suite: Characterization and Architectural Implications," PACT, Oct. 2008. Google ScholarDigital Library
J. Chang and G. S. Sohi. "Cooperative Caching for Chip Multiprocessors," ISCA, June 2006. Google ScholarDigital Library
M. Chaudhuri. "PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared Chip-multiprocessor Caches," HPCA, Feb. 2009.Google Scholar
Z. Chishti, M. D. Powell, and T. N. Vijaykumar. "Optimizing Replication, Communication, and Capacity Allocation in CMPs," ISCA, June 2005. Google ScholarDigital Library
S. Cho and L. Jin "Managing Distributed Shared L2 Caches through OS-Level Page Allocation," MICRO, Dec 2006. Google ScholarDigital Library
Z. Guz, I. Keidar, A. Kolodny, U. C. Weiser. "Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture," SPAA, June 2008. Google ScholarDigital Library
M. Hammoud, S. Cho, and R. Melhem. "A Dynamic Pressure-Aware Associative Placement Strategy for Large Scale Chip Multiprocessors," Computer Architecture Letters, May 2010. Google ScholarDigital Library
M. Hammoud, S. Cho, and R. Melhem. "ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors," HiPEAC, Jan. 2009. Google ScholarDigital Library
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. "Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches," ISCA, June 2009. Google ScholarDigital Library
HP Labs. "http://www.hpl.hp.com/research/cacti/"Google Scholar
J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. "A NUCA Substrate for Flexible CMP Cache Sharing," ICS, June 2005. Google ScholarDigital Library
L. Jin and S. Cho. "Taming Single-Thread Program Performance on Many Distributed On-Chip L2 Caches," ICPP, September 2008. Google ScholarDigital Library
N. P. Jouppi. "Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers," ISCA, 1990. Google ScholarDigital Library
M. Kandemir, F. Li, M. J. Irwin, and S. W. Son. "A Novel Migration-Based NUCA Design for Chip Multiprocessors," Proc. HiPC, Nov. 2008. Google ScholarDigital Library
C. Kim, D. Burger, and S. W. Keckler. "An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches," ASPLOS, Oct. 2002. Google ScholarDigital Library
P. Kongetira, K. Aingaran, and K. Olukotun. "Niagara: A 32-Way Multithreaded Sparc Processor," IEEE Micro, March--April 2005. Google ScholarDigital Library
G. Memik, G. Reinman, and W. H. Mangione-Smith. "Reducing Energy and Delay Using Efficient Victim Caches," ISLPED, 2003. Google ScholarDigital Library
K. Olukotun, L. Hammond, and J. Laudon. "Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency," Synthesis Lectures on Computer Arch, 1st Ed., Morgan and Claypool, Dec. 2007. Google ScholarDigital Library
M. K. Qureshi. "Adaptive Spill-Receive for Robust High-Performance Caching in CMPs," HPCA, Feb. 2009.Google Scholar
Research at Intel. "Introducing the 45nm Next-Generation Intel Core#8482; Microarchitecture," White Paper.Google Scholar
A. Ros, M. E. Acacio, and J. M. García "Scalable Directory Organization for Tiled CMP Architectures," ICCAD, July 2008.Google Scholar
T. Sherwood, B. Calder, and J. Emer. "Reducing CacheMisses Using Hardware and Software Page Placement," ICS, June 1999. Google ScholarDigital Library
B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. "POWER5 System Microarchitecture," IBM J. Res. & Dev., July. 2005. Google ScholarDigital Library
S. Srikantaiah, M. Kandemir, and M. J. Irwin. "Adaptive Set Pinning: Managing Shared Caches in Chip Multiprocessors," ASPLOS, March 2008. Google ScholarDigital Library
S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. "Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers," HPCA, Feb. 2007. Google ScholarDigital Library
Standard Performance Evaluation Corporation. http://www.specbench.org.Google Scholar
D. Tam, R. Azimi, L. Soares, and M. Stumm. "Managing Shared L2 Caches on Multicore Systems in Software," WIOSCA, 2007.Google Scholar
N. Topham, A. Gonzalez, and J. Gonzalez. "The Design and Performance of a Conflict-Avoiding Cache," MICRO, 1997. Google ScholarDigital Library
H. Vandierendonck, P. Manet, and J.-D. Legat. "Application-Specific Reconfigurable XOR-Indexing To Eliminate Cache Conflict Misses," DATE, 2006. Google ScholarDigital Library
Virtutech AB. Simics Full System Simulator "http://www.simics.com/"Google Scholar
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. "The SPLASH-2 Programs: Characterization and Methodological Considerations," ISCA, July 1995. Google ScholarDigital Library
C. Zhang. "Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches," ISCA, June 2006. Google ScholarDigital Library
M. Zhang and K. Asanović. "Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors," ISCA, June 2005. Google ScholarDigital Library

Index Terms

Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Reactive NUCA: near-optimal block placement and replication in distributed caches

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Read More
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Read More
Towards hybrid last level caches for chip-multiprocessors

As CMP platforms are widely adopted, more and more cores are integrated on to the die. To reduce the off-chip memory access, the last level cache is usually organized as a distributed shared cache. In order to avoid hot-spots, cache lines are ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HiPEAC '11: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
January 2011
226 pages
ISBN:9781450302418
DOI:10.1145/1944862
General Chairs:
Manolis Katevenis
FORTH-ICS and U.Crete, Greece
,
Margaret Martonosi
Princeton University
,
Program Chairs:
Christos Kozyrakis
Stanford University
,
Olivier Temam
INRIA, France
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 January 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
chip multiprocessors
group-based placement
pressure-aware placement
private cache
shared cache
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 181
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.