Abstract
Inclusive caches have been widely used in Chip Multiprocessors (CMPs) to simplify cache coherence. However, they have poor performance compared with noninclusive caches not only because of the limited capacity of the entire cache hierarchy but also due to ignorance of temporal locality of the Last-Level Cache (LLC). Blocks that are highly referenced (referred to as hot blocks) are always hit in higher-level caches (e.g., L1 cache) and are rarely referenced in the LLC. Therefore, they become replacement victims in the LLC. Due to the inclusion property, blocks evicted from the LLC have to also be invalidated from higher-level caches. Invalidation of hot blocks from the entire cache hierarchy introduces costly off-chip misses that makes the inclusive cache perform poorly.
Neither blocks that are highly referenced in the LLC nor blocks that are highly referenced in higher-level caches should be the LLC replacement victims. We propose temporal-based multilevel correlating cache replacement for inclusive caches to evict blocks in the LLC that are also not hot in higher-level caches using correlated temporal information acquired from all levels of a cache hierarchy with minimal overhead. Invalidation of these blocks does not hurt the performance. By contrast, replacing them as early as possible with useful blocks helps improve cache performance. Based on our experiments, in a dual-core CMP, an inclusive cache with temporal-based multilevel correlating cache replacement significantly outperforms an inclusive cache with traditional LRU replacement by yielding an average speedup of 12.7%, which is comparable to an enhanced noninclusive cache, while requiring less than 1% of storage overhead.
- AMD. 2012. AMD FX processors. (2012). Retrieved November 13, 2013, from http://www.amd.com/us/products/desktop/processors/amdfx/.Google Scholar
- Baer, J.-L. and Wang, W.-H. 1988. On the Inclusion Properties for Multi-Level Cache Hierarchies. Vol. 16. IEEE Computer Society Press. Google ScholarDigital Library
- Casazza, J. 2009. Intel Core i7-800 processor series and the Intel Core i5-700 processor series based on Intel microarchitecture (Nehalem). White paper, Intel Corp.Google Scholar
- Chen, X., Yang, Y., Gopalakrishnan, G., and Chou, C. T. 2006. Reducing verification complexity of a multicore coherence protocol using assume/guarantee. In Formal Methods in Computer Aided Design, 2006 (FMCAD'06). IEEE, Los Alamitos, CA, 81--88. Google ScholarDigital Library
- Lai, A.-C. and Falsafi, B. 2000. Selective, accurate, and timely self-invalidation using last-touch prediction. In Proceedings of the 27th Annual International Symposium on Computer Architecture. 139--148. Google ScholarDigital Library
- Lai, A.-C., Fide, C., and Falsafi, B. 2001. Dead-block prediction and dead-block correlating prefetchers. In Proceedings of the 28th International Symposium on Computer Architecture. 144--154. Google ScholarDigital Library
- Garde, R. V., Subramaniam, S., and Loh, G. H. 2008. Deconstructing the inefficacy of global cache replacement policies. In Proceedings of the 7th Workshop on Duplicating, Deconstructing, and Debunking (WDDD'08).Google Scholar
- Gaur, J., Chaudhuri, M., and Subramoney, S. 2011. Bypass and insertion algorithms for exclusive last-level caches. In Proceedings of the 38th Annual International Symposium on Computer Architecture. ACM, New York, NY, 81--92. Google ScholarDigital Library
- Henning, J. L. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Architecture News 34, 4, 1--17. Google ScholarDigital Library
- Hu, Z., Kaxiras, S., and Martonosi, M. 2002. Timekeeping in the memory system: Predicting and optimizing memory behavior. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA'02). IEEE, Los Alamitos, CA, 209--220. http://dl.acm.org/citation.cfm?id=545215.545239. Google ScholarDigital Library
- Jaleel, A., Borch, E., Bhandaru, M., Steely Jr., S. C., and Emer, J. 2010. Achieving non-inclusive cache performance with inclusive caches: Temporal Locality Aware (TLA) cache management policies. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'43). IEEE, Los Alamitos, CA, 151--162. http://dx.doi.org/10.1109/MICRO.2010.52. Google ScholarDigital Library
- Jouppi, N. P. and Wilton, S. J. E. 1994. Tradeoffs in two-level on-chip caching. ACM SIGARCH Comput. Architecture News 22, 2, 34--45. Google ScholarDigital Library
- Khan, S. M., Jiménez, D. A., Burger, D., and Falsafi, B. 2010a. Using dead blocks as a virtual victim cache. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT'10). ACM, New York, NY, 489--500. http://dx.doi.org/10.1145/1854273.1854333. Google ScholarDigital Library
- Khan, S. M., Tian, Y., and Jimenez, D. A. 2010b. Sampling dead block prediction for last-level caches. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'43). IEEE, Los Alamitos, CA, 175--186. http://dx.doi.org/10.1109/MICRO.2010.24. Google ScholarDigital Library
- Kharbutli, M. and Solihin, Y. 2008. Counter-based cache replacement and bypassing algorithms. IEEE Trans. Comput. 57, 4, 433--447. http://dx.doi.org/10.1109/TC.2007.70816. Google ScholarDigital Library
- Kurd, N. A., Bhamidipati, S., Mozak, C., Miller, J. L., Wilson, T. M., Nemani, M., and Chowdhury, M. 2010. Westmere: A family of 32nm IA processors. In Proceedings of the 2010 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC'10). IEEE, Los Alamitos, CA, 96--97.Google Scholar
- Lebeck, A. R. and Wood, D. A. 1995. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 48--59. Google ScholarDigital Library
- Michaud, P., Seznec, A., and Uhlig, R. 1997. Trading conflict and capacity aliasing in conditional branch predictors. ACM SIGARCH Comput. Architect. News 25, 2, 292--303. http://dx.doi.org/10.1145/384286.264211. Google ScholarDigital Library
- Muralimanohar, N., Balasubramonian, R., and Jouppi, N. P. 2009. CACTI 6.0: A tool to model large caches. Technical report HPL-2009-85. HP Laboratories.Google Scholar
- Nair, R. 1995. Dynamic path-based branch correlation. In Proceedings of the 28th Annual International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 15--23. Google ScholarDigital Library
- Patel, A., Afram, F., Chen, S., and Ghose, K. 2011. MARSSx86: A full system simulator for x86 CPUs. In Proceedings of the 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC'11). 1050--1055. Google ScholarDigital Library
- Qureshi, M. K., Jaleel, A., Patt, Y. N., Steely, S. C., and Emer, J. 2007. Adaptive insertion policies for high performance caching. In ACM SIGARCH Comput. Architecture News, 35, 381--391. Google ScholarDigital Library
- Qureshi, M. K. and Patt, Y. N. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'39). IEEE, Los Alamitos, CA, 423--432. http://dx.doi.org/10.1109/MICRO.2006.49. Google ScholarDigital Library
- Sanchez, D. and Kozyrakis, C. 2010. The zcache: Decoupling ways and associativity. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, Los Alamitos, CA, 187--198. Google ScholarDigital Library
- Seznec, A. 1993. A case for two-way skewed-associative caches. In Proceedings of the 20th Annual International Symposium on Computer Architecture (ISCA'93). 169--178. Google ScholarDigital Library
- Somogyi, S., Wenisch, T. F., Hardavellas, N., Kim, J., Ailamaki, A., and Falsafi, B. 2004. Memory coherence activity prediction in commercial workloads. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture (WMPI'04). ACM, New York, NY, 37--45. http://dx.doi.org/10.1145/1054943.1054949. Google ScholarDigital Library
- Wendel, D. F., Kalla, R., Warnock, J. D., Cargnoni, R., Chu, S. G., Clabes, J. G., Dreps, D., Hrusecky, D., Friedrich, J., Islam, S., et al. 2011. POWER7, a highly parallel, scalable multi-core high end server processor. IEEE J. Solid-State Circuits 46, 1, 145--161.Google ScholarCross Ref
- Wu, C.-J., Jaleel, A., Hasenplaugh, W., Martonosi Jr., M., Steely, S. C., and Emer, J. 2011. SHiP: Signature-based hit predictor for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44'11). ACM, New York, NY, 430--441. http://dx.doi.org/10.1145/2155620.2155671. Google ScholarDigital Library
- Zahran, M. 2007. Cache replacement policy revisited. In Proceedings of the 6th Workshop on Duplicating, Deconstructing, and Debunking (ISCA).Google Scholar
- Zahran, M., Albayraktaroglu, K., and Franklin, M. 2007. Non-inclusion property in multi-level caches revisited. Int. J. Comput. Appl. 14, 2, 99.Google Scholar
- Zahran, M. and McKee, S. A. 2010. Global management of cache hierarchies. In Proceedings of the 7th ACM International Conference on Computing Frontiers (CF'10). ACM, New York, NY, 131--140. http://dx.doi.org/10.1145/1787275.1787315. Google ScholarDigital Library
- Zheng, Y., Davis, B. T., and Jordan, M. 2004. Performance evaluation of exclusive cache hierarchies. In Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, Los Alamitos, CA, 89--96. http://dl.acm.org/citation.cfm?id=1153925.1154591. Google ScholarDigital Library
Index Terms
Temporal-based multilevel correlating inclusive cache replacement
Recommendations
Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches
MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on MicroarchitectureCache blocks often exhibit a small number of uses during their life time in the last-level cache. Past research has exploited this property in two different ways. First, replacement policies have been designed to evict dead blocks early and retain the ...
Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesThe replacement policies for the last-level caches (LLCs) are usually designed based on the access information available locally at the LLC. These policies are inherently sub-optimal due to lack of information about the activities in the inner-levels of ...
Dense Footprint Cache: Capacity-Efficient Die-Stacked DRAM Last Level Cache
MEMSYS '16: Proceedings of the Second International Symposium on Memory SystemsDie-stacked DRAM technology enables a large Last Level Cache (LLC) that provides high bandwidth data access to the processor. However, it requires a large tag array that may take a significant portion of the on-chip SRAM budget. To reduce this SRAM ...
Comments