research-article

The evicted-address filter: a unified mechanism to address both cache pollution and thrashing

Authors:
Vivek Seshadri

Carnegie Mellon University, Pittsburgh, USA

Carnegie Mellon University, Pittsburgh, USA
View Profile

,
Onur Mutlu

Carnegie Mellon University, Pittsburgh, USA

Carnegie Mellon University, Pittsburgh, USA
View Profile

,
Michael A. Kozuch

Intel Labs Pittsburgh, Pittsburgh, USA

Intel Labs Pittsburgh, Pittsburgh, USA
View Profile

,
Todd C. Mowry

Carnegie Mellon University, Pittsburgh, USA

Carnegie Mellon University, Pittsburgh, USA
View Profile

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesSeptember 2012Pages 355–366https://doi.org/10.1145/2370816.2370868

Published:19 September 2012Publication History

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Pages 355–366

ABSTRACT

Off-chip main memory has long been a bottleneck for system performance. With increasing memory pressure due to multiple on-chip cores, effective cache utilization is important. In a system with limited cache space, we would ideally like to prevent 1) cache pollution, i.e., blocks with low reuse evicting blocks with high reuse from the cache, and 2) cache thrashing, i.e., blocks with high reuse evicting each other from the cache.

In this paper, we propose a new, simple mechanism to predict the reuse behavior of missed cache blocks in a manner that mitigates both pollution and thrashing. Our mechanism tracks the addresses of recently evicted blocks in a structure called the Evicted-Address Filter (EAF). Missed blocks whose addresses are present in the EAF are predicted to have high reuse and all other blocks are predicted to have low reuse. The key observation behind this prediction scheme is that if a block with high reuse is prematurely evicted from the cache, it will be accessed soon after eviction. We show that an EAF-implementation using a Bloom filter, which is cleared periodically, naturally mitigates the thrashing problem by ensuring that only a portion of a thrashing working set is retained in the cache, while incurring low storage cost and implementation complexity.

We compare our EAF-based mechanism to five state-of-the-art mechanisms that address cache pollution or thrashing, and show that it provides significant performance improvements for a wide variety of workloads and system configurations.

References

AMD Phenom II key architectural features. http://goo.gl/iQBfK.Google Scholar
Intel next generation microarchitecture. http://goo.gl/3eskx.Google Scholar
Oracle SPARC T4. http://goo.gl/KZSnc.Google Scholar
Wind River Simics. www.windriver.com/producs/simics.Google Scholar
S. Bansal and D. S. Modha. CAR: Clock with adaptive replacement. In FAST, 2004. Google ScholarDigital Library
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. ACM Communications, 13, July 1970. Google ScholarDigital Library
J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In ICS, 2007. Google ScholarDigital Library
Y. Chen, A. Kumar, and J. Xu. A new design of Bloom Vlter for packet inspection speedup. In GLOBECOM, 2007.Google Scholar
J. Collins and D. M. Tullsen. Hardware identiVcation of cache conflict misses. In MICRO, 1999. Google ScholarDigital Library
C. Ding and K. Kennedy. Improving eUective bandwidth through compiler enhancement of global cache reuse. JPDC, 2004. Google ScholarDigital Library
S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 2008. Google ScholarDigital Library
A. Goel and P. Gupta. Small subset queries and Bloom Vlters using ternary associative memories, with applications. In SIGMETRICS, 2010. Google ScholarDigital Library
E. G. Hallnor and S. K. Reinhardt. A fully associative software managed cache design. In ISCA, 2000. Google ScholarDigital Library
Intel. Intel 64 and IA-32 architectures software developer's manuals, 2011.Google Scholar
R. Iyer. CQoS: A framework for enabling QoS in shared caches of CMP platforms. In ICS, 2004. Google ScholarDigital Library
A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and J. Emer. Adaptive insertion policies for managing shared caches. In PACT, 2008. Google ScholarDigital Library
A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction. In ISCA, 2010. Google ScholarDigital Library
S. Jiang and X. Zhang. LIRS: An eXcient low inter-reference recency set replacement policy to improve buffer cache performance. In SIGMETRICS, 2002. Google ScholarDigital Library
T. Johnson, D. Connors, M. Merten, and W.-M. Hwu. Run-time cache bypassing. IEEE TC, 1999. Google ScholarDigital Library
T. Johnson and D. Shasha. 2Q: A low overhead high performance buffer management replacement algorithm. In VLDB, 1994. Google ScholarDigital Library
N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA, 1990. Google ScholarDigital Library
R. Kalla, B. Sinharoy, W. Starke, and M. Floyd. Power7: IBM's next-generation server processor. IEEE Micro, 2010. Google ScholarDigital Library
G. Keramidas, P. Petoumenos, and S. Kaxiras. Cache replacement based on reuse-distance prediction. In ICCD, 2007.Google ScholarCross Ref
S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, 2004. Google ScholarDigital Library
Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA, 2010.Google Scholar
Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In MICRO, 2010. Google ScholarDigital Library
H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In MICRO, 2008. Google ScholarDigital Library
K. Luo, J. Gummaraju, and M. Franklin. Balancing thoughput and fairness in SMT processors. In ISPASS, 2001.Google Scholar
M. J. Lyons and D. Brooks. The design of a Bloom Vlter hardware accelerator for ultra low power systems. In ISLPED, 2009. Google ScholarDigital Library
N. Megiddo and D. S. Modha. ARC: A self-tuning, low overhead replacement cache. In FAST, 2003. Google ScholarDigital Library
K. J. Nesbit, J. Laudon, and J. E. Smith. Virtual private caches. In ISCA, 2007. Google ScholarDigital Library
E. J. O'Neil, P. E. O'Neil, and G. Weikum. The LRU-K page replacement algorithm for database disk buffering. In SIGMOD, 1993. Google ScholarDigital Library
J.-K. Peir, S.-C. Lai, S.-L. Lu, J. Stark, and K. Lai. Bloom filtering cache misses for accurate data speculation and prefetching. In ICS, 2002. Google ScholarDigital Library
T. Piquet, O. Rochecouste, and A. Seznec. Exploiting single-usage for effective memory management. In ACSAC, 2007. Google ScholarDigital Library
M. K. Qureshi, A. Jaleel, Y. Patt, S. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA, 2007. Google ScholarDigital Library
M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. A case for MLP-aware cache replacement. In ISCA, 2006. Google ScholarDigital Library
M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO, 2006. Google ScholarDigital Library
M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-way cache: Demand based associativity via global replacement. In ISCA, 2005. Google ScholarDigital Library
K. Rajan and G. Ramaswamy. Emulating optimal replacement with a shepherd cache. In MICRO, 2007. Google ScholarDigital Library
R. Rajwar, M. Herlihy, and K. Lai. Virtualizing transactional memory. In ISCA, 2005. Google ScholarDigital Library
M. V. Ramakrishna, E. Fu, and E. Bahcekapili. Efficient hardware hashing functions for high performance computers. IEEE TC, 1997. Google ScholarDigital Library
G. Rivera and C.-W. Tseng. Compiler optimizations for eliminating cache conWict misses. Technical Report UMIACS-TR-97-59, University of Maryland, College Park, 1997. Google ScholarDigital Library
J. Rivers and E. S. Davidson. Reducing conflicts in direct-mapped caches with a temporality-based design. In ICPP, 1996.Google ScholarCross Ref
D. Rolan, B. B. Fraguela, and R. Doallo. Adaptive line placement with the set balancing cache. In MICRO, 2009. Google ScholarDigital Library
D. Rolan, B. B. Fraguela, and R. Doallo. Reducing capacity and conflict misses using set saturation levels. In HiPC, 2010.Google ScholarCross Ref
D. Sanchez and C. Kozyrakis. The ZCache: Decoupling ways and associativity. In MICRO, 2010. Google ScholarDigital Library
V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry. The Evicted-Address Filter: A uniVed mechanism to address both cache pollution and thrashing. Technical Report SAFARI 2012-002, CMU, 2012.Google ScholarDigital Library
A. Seznec. A case for two-way skewed-associative caches. In ISCA, 1993. Google ScholarDigital Library
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In ASPLOS, 2002. Google ScholarDigital Library
A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreaded processor. In ASPLOS, 2000. Google ScholarDigital Library
H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory. IEEE TC, Sep. 1992. Google ScholarDigital Library
G. E. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In HPCA, 2002. Google ScholarDigital Library
G. S. Tyson, M. K. Farrens, J. Matthews, and A. R. Pleszkun. A modified approach to data cache management. In MICRO, 1995. Google ScholarDigital Library
Z. Wang, K. S. McKinley, A. L. Rosenberg, and C. C. Weems. Using the compiler to improve cache replacement decisions. In PACT, 2002. Google ScholarDigital Library
C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. Steely Jr., and J. Emer. SHIP: Signature-based hit predictor for high performance caching. In MICRO, 2011. Google ScholarDigital Library
Y. Xie and G. H. Loh. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. In ISCA, 2009. Google ScholarDigital Library
D. Zhan, H. Jiang, and S. C. Seth. STEM: Spatiotemporal management of capacity for intra-core last level caches. In MICRO, 2010. Google ScholarDigital Library

Index Terms

The evicted-address filter: a unified mechanism to address both cache pollution and thrashing
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Read More
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10

Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Read More
Adaptive insertion policies for high performance caching

The commonly used LRU replacement policy is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. For such applications, the majority of lines traverse from the MRU position to the LRU ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques
September 2012
512 pages
ISBN:9781450311823
DOI:10.1145/2370816
General Chairs:
Pen-Chung Yew
University of Minnesota
,
Sangyeun Cho
University of Pittsburgh
,
Program Chairs:
Luiz DeRose
Cray, Inc.
,
David J. Lilja
University of Minnesota
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 September 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
caching
insertion policy
memory
pollution
thrashing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate121of471submissions,26%
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 92
  Total Citations
  View Citations
- 633
  Total Downloads
- Downloads (Last 12 months)66
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.