ABSTRACT
Off-chip main memory has long been a bottleneck for system performance. With increasing memory pressure due to multiple on-chip cores, effective cache utilization is important. In a system with limited cache space, we would ideally like to prevent 1) cache pollution, i.e., blocks with low reuse evicting blocks with high reuse from the cache, and 2) cache thrashing, i.e., blocks with high reuse evicting each other from the cache.
In this paper, we propose a new, simple mechanism to predict the reuse behavior of missed cache blocks in a manner that mitigates both pollution and thrashing. Our mechanism tracks the addresses of recently evicted blocks in a structure called the Evicted-Address Filter (EAF). Missed blocks whose addresses are present in the EAF are predicted to have high reuse and all other blocks are predicted to have low reuse. The key observation behind this prediction scheme is that if a block with high reuse is prematurely evicted from the cache, it will be accessed soon after eviction. We show that an EAF-implementation using a Bloom filter, which is cleared periodically, naturally mitigates the thrashing problem by ensuring that only a portion of a thrashing working set is retained in the cache, while incurring low storage cost and implementation complexity.
We compare our EAF-based mechanism to five state-of-the-art mechanisms that address cache pollution or thrashing, and show that it provides significant performance improvements for a wide variety of workloads and system configurations.
- AMD Phenom II key architectural features. http://goo.gl/iQBfK.Google Scholar
- Intel next generation microarchitecture. http://goo.gl/3eskx.Google Scholar
- Oracle SPARC T4. http://goo.gl/KZSnc.Google Scholar
- Wind River Simics. www.windriver.com/producs/simics.Google Scholar
- S. Bansal and D. S. Modha. CAR: Clock with adaptive replacement. In FAST, 2004. Google ScholarDigital Library
- B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. ACM Communications, 13, July 1970. Google ScholarDigital Library
- J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In ICS, 2007. Google ScholarDigital Library
- Y. Chen, A. Kumar, and J. Xu. A new design of Bloom Vlter for packet inspection speedup. In GLOBECOM, 2007.Google Scholar
- J. Collins and D. M. Tullsen. Hardware identiVcation of cache conflict misses. In MICRO, 1999. Google ScholarDigital Library
- C. Ding and K. Kennedy. Improving eUective bandwidth through compiler enhancement of global cache reuse. JPDC, 2004. Google ScholarDigital Library
- S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 2008. Google ScholarDigital Library
- A. Goel and P. Gupta. Small subset queries and Bloom Vlters using ternary associative memories, with applications. In SIGMETRICS, 2010. Google ScholarDigital Library
- E. G. Hallnor and S. K. Reinhardt. A fully associative software managed cache design. In ISCA, 2000. Google ScholarDigital Library
- Intel. Intel 64 and IA-32 architectures software developer's manuals, 2011.Google Scholar
- R. Iyer. CQoS: A framework for enabling QoS in shared caches of CMP platforms. In ICS, 2004. Google ScholarDigital Library
- A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and J. Emer. Adaptive insertion policies for managing shared caches. In PACT, 2008. Google ScholarDigital Library
- A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction. In ISCA, 2010. Google ScholarDigital Library
- S. Jiang and X. Zhang. LIRS: An eXcient low inter-reference recency set replacement policy to improve buffer cache performance. In SIGMETRICS, 2002. Google ScholarDigital Library
- T. Johnson, D. Connors, M. Merten, and W.-M. Hwu. Run-time cache bypassing. IEEE TC, 1999. Google ScholarDigital Library
- T. Johnson and D. Shasha. 2Q: A low overhead high performance buffer management replacement algorithm. In VLDB, 1994. Google ScholarDigital Library
- N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA, 1990. Google ScholarDigital Library
- R. Kalla, B. Sinharoy, W. Starke, and M. Floyd. Power7: IBM's next-generation server processor. IEEE Micro, 2010. Google ScholarDigital Library
- G. Keramidas, P. Petoumenos, and S. Kaxiras. Cache replacement based on reuse-distance prediction. In ICCD, 2007.Google ScholarCross Ref
- S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, 2004. Google ScholarDigital Library
- Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA, 2010.Google Scholar
- Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In MICRO, 2010. Google ScholarDigital Library
- H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In MICRO, 2008. Google ScholarDigital Library
- K. Luo, J. Gummaraju, and M. Franklin. Balancing thoughput and fairness in SMT processors. In ISPASS, 2001.Google Scholar
- M. J. Lyons and D. Brooks. The design of a Bloom Vlter hardware accelerator for ultra low power systems. In ISLPED, 2009. Google ScholarDigital Library
- N. Megiddo and D. S. Modha. ARC: A self-tuning, low overhead replacement cache. In FAST, 2003. Google ScholarDigital Library
- K. J. Nesbit, J. Laudon, and J. E. Smith. Virtual private caches. In ISCA, 2007. Google ScholarDigital Library
- E. J. O'Neil, P. E. O'Neil, and G. Weikum. The LRU-K page replacement algorithm for database disk buffering. In SIGMOD, 1993. Google ScholarDigital Library
- J.-K. Peir, S.-C. Lai, S.-L. Lu, J. Stark, and K. Lai. Bloom filtering cache misses for accurate data speculation and prefetching. In ICS, 2002. Google ScholarDigital Library
- T. Piquet, O. Rochecouste, and A. Seznec. Exploiting single-usage for effective memory management. In ACSAC, 2007. Google ScholarDigital Library
- M. K. Qureshi, A. Jaleel, Y. Patt, S. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA, 2007. Google ScholarDigital Library
- M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. A case for MLP-aware cache replacement. In ISCA, 2006. Google ScholarDigital Library
- M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO, 2006. Google ScholarDigital Library
- M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-way cache: Demand based associativity via global replacement. In ISCA, 2005. Google ScholarDigital Library
- K. Rajan and G. Ramaswamy. Emulating optimal replacement with a shepherd cache. In MICRO, 2007. Google ScholarDigital Library
- R. Rajwar, M. Herlihy, and K. Lai. Virtualizing transactional memory. In ISCA, 2005. Google ScholarDigital Library
- M. V. Ramakrishna, E. Fu, and E. Bahcekapili. Efficient hardware hashing functions for high performance computers. IEEE TC, 1997. Google ScholarDigital Library
- G. Rivera and C.-W. Tseng. Compiler optimizations for eliminating cache conWict misses. Technical Report UMIACS-TR-97-59, University of Maryland, College Park, 1997. Google ScholarDigital Library
- J. Rivers and E. S. Davidson. Reducing conflicts in direct-mapped caches with a temporality-based design. In ICPP, 1996.Google ScholarCross Ref
- D. Rolan, B. B. Fraguela, and R. Doallo. Adaptive line placement with the set balancing cache. In MICRO, 2009. Google ScholarDigital Library
- D. Rolan, B. B. Fraguela, and R. Doallo. Reducing capacity and conflict misses using set saturation levels. In HiPC, 2010.Google ScholarCross Ref
- D. Sanchez and C. Kozyrakis. The ZCache: Decoupling ways and associativity. In MICRO, 2010. Google ScholarDigital Library
- V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry. The Evicted-Address Filter: A uniVed mechanism to address both cache pollution and thrashing. Technical Report SAFARI 2012-002, CMU, 2012.Google ScholarDigital Library
- A. Seznec. A case for two-way skewed-associative caches. In ISCA, 1993. Google ScholarDigital Library
- T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In ASPLOS, 2002. Google ScholarDigital Library
- A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreaded processor. In ASPLOS, 2000. Google ScholarDigital Library
- H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory. IEEE TC, Sep. 1992. Google ScholarDigital Library
- G. E. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In HPCA, 2002. Google ScholarDigital Library
- G. S. Tyson, M. K. Farrens, J. Matthews, and A. R. Pleszkun. A modified approach to data cache management. In MICRO, 1995. Google ScholarDigital Library
- Z. Wang, K. S. McKinley, A. L. Rosenberg, and C. C. Weems. Using the compiler to improve cache replacement decisions. In PACT, 2002. Google ScholarDigital Library
- C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. Steely Jr., and J. Emer. SHIP: Signature-based hit predictor for high performance caching. In MICRO, 2011. Google ScholarDigital Library
- Y. Xie and G. H. Loh. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. In ISCA, 2009. Google ScholarDigital Library
- D. Zhan, H. Jiang, and S. C. Seth. STEM: Spatiotemporal management of capacity for intra-core last level caches. In MICRO, 2010. Google ScholarDigital Library
Index Terms
- The evicted-address filter: a unified mechanism to address both cache pollution and thrashing
Recommendations
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecturePractical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Adaptive insertion policies for high performance caching
The commonly used LRU replacement policy is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. For such applications, the majority of lines traverse from the MRU position to the LRU ...
Comments