skip to main content
research-article
Free Access

Temporal-based multilevel correlating inclusive cache replacement

Published:01 December 2013Publication History
Skip Abstract Section

Abstract

Inclusive caches have been widely used in Chip Multiprocessors (CMPs) to simplify cache coherence. However, they have poor performance compared with noninclusive caches not only because of the limited capacity of the entire cache hierarchy but also due to ignorance of temporal locality of the Last-Level Cache (LLC). Blocks that are highly referenced (referred to as hot blocks) are always hit in higher-level caches (e.g., L1 cache) and are rarely referenced in the LLC. Therefore, they become replacement victims in the LLC. Due to the inclusion property, blocks evicted from the LLC have to also be invalidated from higher-level caches. Invalidation of hot blocks from the entire cache hierarchy introduces costly off-chip misses that makes the inclusive cache perform poorly.

Neither blocks that are highly referenced in the LLC nor blocks that are highly referenced in higher-level caches should be the LLC replacement victims. We propose temporal-based multilevel correlating cache replacement for inclusive caches to evict blocks in the LLC that are also not hot in higher-level caches using correlated temporal information acquired from all levels of a cache hierarchy with minimal overhead. Invalidation of these blocks does not hurt the performance. By contrast, replacing them as early as possible with useful blocks helps improve cache performance. Based on our experiments, in a dual-core CMP, an inclusive cache with temporal-based multilevel correlating cache replacement significantly outperforms an inclusive cache with traditional LRU replacement by yielding an average speedup of 12.7%, which is comparable to an enhanced noninclusive cache, while requiring less than 1% of storage overhead.

References

  1. AMD. 2012. AMD FX processors. (2012). Retrieved November 13, 2013, from http://www.amd.com/us/products/desktop/processors/amdfx/.Google ScholarGoogle Scholar
  2. Baer, J.-L. and Wang, W.-H. 1988. On the Inclusion Properties for Multi-Level Cache Hierarchies. Vol. 16. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Casazza, J. 2009. Intel Core i7-800 processor series and the Intel Core i5-700 processor series based on Intel microarchitecture (Nehalem). White paper, Intel Corp.Google ScholarGoogle Scholar
  4. Chen, X., Yang, Y., Gopalakrishnan, G., and Chou, C. T. 2006. Reducing verification complexity of a multicore coherence protocol using assume/guarantee. In Formal Methods in Computer Aided Design, 2006 (FMCAD'06). IEEE, Los Alamitos, CA, 81--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Lai, A.-C. and Falsafi, B. 2000. Selective, accurate, and timely self-invalidation using last-touch prediction. In Proceedings of the 27th Annual International Symposium on Computer Architecture. 139--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Lai, A.-C., Fide, C., and Falsafi, B. 2001. Dead-block prediction and dead-block correlating prefetchers. In Proceedings of the 28th International Symposium on Computer Architecture. 144--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Garde, R. V., Subramaniam, S., and Loh, G. H. 2008. Deconstructing the inefficacy of global cache replacement policies. In Proceedings of the 7th Workshop on Duplicating, Deconstructing, and Debunking (WDDD'08).Google ScholarGoogle Scholar
  8. Gaur, J., Chaudhuri, M., and Subramoney, S. 2011. Bypass and insertion algorithms for exclusive last-level caches. In Proceedings of the 38th Annual International Symposium on Computer Architecture. ACM, New York, NY, 81--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Henning, J. L. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Architecture News 34, 4, 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hu, Z., Kaxiras, S., and Martonosi, M. 2002. Timekeeping in the memory system: Predicting and optimizing memory behavior. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA'02). IEEE, Los Alamitos, CA, 209--220. http://dl.acm.org/citation.cfm?id=545215.545239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jaleel, A., Borch, E., Bhandaru, M., Steely Jr., S. C., and Emer, J. 2010. Achieving non-inclusive cache performance with inclusive caches: Temporal Locality Aware (TLA) cache management policies. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'43). IEEE, Los Alamitos, CA, 151--162. http://dx.doi.org/10.1109/MICRO.2010.52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jouppi, N. P. and Wilton, S. J. E. 1994. Tradeoffs in two-level on-chip caching. ACM SIGARCH Comput. Architecture News 22, 2, 34--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Khan, S. M., Jiménez, D. A., Burger, D., and Falsafi, B. 2010a. Using dead blocks as a virtual victim cache. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT'10). ACM, New York, NY, 489--500. http://dx.doi.org/10.1145/1854273.1854333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Khan, S. M., Tian, Y., and Jimenez, D. A. 2010b. Sampling dead block prediction for last-level caches. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'43). IEEE, Los Alamitos, CA, 175--186. http://dx.doi.org/10.1109/MICRO.2010.24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kharbutli, M. and Solihin, Y. 2008. Counter-based cache replacement and bypassing algorithms. IEEE Trans. Comput. 57, 4, 433--447. http://dx.doi.org/10.1109/TC.2007.70816. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kurd, N. A., Bhamidipati, S., Mozak, C., Miller, J. L., Wilson, T. M., Nemani, M., and Chowdhury, M. 2010. Westmere: A family of 32nm IA processors. In Proceedings of the 2010 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC'10). IEEE, Los Alamitos, CA, 96--97.Google ScholarGoogle Scholar
  17. Lebeck, A. R. and Wood, D. A. 1995. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 48--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Michaud, P., Seznec, A., and Uhlig, R. 1997. Trading conflict and capacity aliasing in conditional branch predictors. ACM SIGARCH Comput. Architect. News 25, 2, 292--303. http://dx.doi.org/10.1145/384286.264211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Muralimanohar, N., Balasubramonian, R., and Jouppi, N. P. 2009. CACTI 6.0: A tool to model large caches. Technical report HPL-2009-85. HP Laboratories.Google ScholarGoogle Scholar
  20. Nair, R. 1995. Dynamic path-based branch correlation. In Proceedings of the 28th Annual International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 15--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Patel, A., Afram, F., Chen, S., and Ghose, K. 2011. MARSSx86: A full system simulator for x86 CPUs. In Proceedings of the 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC'11). 1050--1055. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Qureshi, M. K., Jaleel, A., Patt, Y. N., Steely, S. C., and Emer, J. 2007. Adaptive insertion policies for high performance caching. In ACM SIGARCH Comput. Architecture News, 35, 381--391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Qureshi, M. K. and Patt, Y. N. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'39). IEEE, Los Alamitos, CA, 423--432. http://dx.doi.org/10.1109/MICRO.2006.49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sanchez, D. and Kozyrakis, C. 2010. The zcache: Decoupling ways and associativity. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, Los Alamitos, CA, 187--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Seznec, A. 1993. A case for two-way skewed-associative caches. In Proceedings of the 20th Annual International Symposium on Computer Architecture (ISCA'93). 169--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Somogyi, S., Wenisch, T. F., Hardavellas, N., Kim, J., Ailamaki, A., and Falsafi, B. 2004. Memory coherence activity prediction in commercial workloads. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture (WMPI'04). ACM, New York, NY, 37--45. http://dx.doi.org/10.1145/1054943.1054949. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Wendel, D. F., Kalla, R., Warnock, J. D., Cargnoni, R., Chu, S. G., Clabes, J. G., Dreps, D., Hrusecky, D., Friedrich, J., Islam, S., et al. 2011. POWER7, a highly parallel, scalable multi-core high end server processor. IEEE J. Solid-State Circuits 46, 1, 145--161.Google ScholarGoogle ScholarCross RefCross Ref
  28. Wu, C.-J., Jaleel, A., Hasenplaugh, W., Martonosi Jr., M., Steely, S. C., and Emer, J. 2011. SHiP: Signature-based hit predictor for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44'11). ACM, New York, NY, 430--441. http://dx.doi.org/10.1145/2155620.2155671. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Zahran, M. 2007. Cache replacement policy revisited. In Proceedings of the 6th Workshop on Duplicating, Deconstructing, and Debunking (ISCA).Google ScholarGoogle Scholar
  30. Zahran, M., Albayraktaroglu, K., and Franklin, M. 2007. Non-inclusion property in multi-level caches revisited. Int. J. Comput. Appl. 14, 2, 99.Google ScholarGoogle Scholar
  31. Zahran, M. and McKee, S. A. 2010. Global management of cache hierarchies. In Proceedings of the 7th ACM International Conference on Computing Frontiers (CF'10). ACM, New York, NY, 131--140. http://dx.doi.org/10.1145/1787275.1787315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Zheng, Y., Davis, B. T., and Jordan, M. 2004. Performance evaluation of exclusive cache hierarchies. In Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, Los Alamitos, CA, 89--96. http://dl.acm.org/citation.cfm?id=1153925.1154591. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Temporal-based multilevel correlating inclusive cache replacement

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 10, Issue 4
      December 2013
      1046 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/2541228
      Issue’s Table of Contents

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 December 2013
      • Accepted: 1 November 2013
      • Revised: 1 October 2013
      • Received: 1 June 2013
      Published in taco Volume 10, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader