skip to main content
research-article

Improving Performance in Sub-Block Caches with Optimized Replacement Policies

Published:27 April 2015Publication History
Skip Abstract Section

Abstract

Recent advances in computer processor design have led to the introduction of sub-blocking to cache architectures. Sub-block caches reduce the tag area and power overhead in caches without reducing the effective cache size by using fewer tags to index the full data RAM array. In spite of achieving reduced area and power overhead, sub-block caches suffer performance degradation due to cache trashing. This occurs when a wider cache line (super-block), made up of multiple valid cache lines (sub-blocks), is replaced or evicted when only a sub-block is to be allocated into the wider super-block. To address this problem, we propose cache replacement policies as they relate specifically to sub-block caches. We propose new replacement policies that are tuned for sub-block caches by adding more intelligence based on the valid state of individual sub-blocks of a super-block. We also investigate the effect of using a few level-0 registers to bypass a few level-1 cache pipe stages on sub-block cache performance. To evaluate the performance improvement offered by our proposed replacement policies and the use of level-0 registers, we developed a sub-block cache simulator based on the Simplescalar toolset for single-core evaluations and the Sniper Simulator for multicore evaluations. We show that, with minimal architectural updates to existing conventional cache replacement policies, we are able to improve level-1 cache hit rates by up to 4.17% using our proposed policies alone on SPEC2006 benchmarks and up to 14% in shared level-2 caches using multicore benchmark suites: PARSEC and SPLASH2.

References

  1. Bryan Ackland, Alex Anesko, Douglas Brinthaupt, Steven J. Daubert, Asawaree Kalavade, et al. 2000. A single-chip, 1.6-billion, 16-B MAC/s multiprocessor DSP. IEEE J. Solid-State Circ. 35, 3, 412--424.Google ScholarGoogle ScholarCross RefCross Ref
  2. Hussein Al-Zoubi, Aleksandar Milenkovic, and Milena Milenkovic. 2004. Performance evaluation of cache replacement policies for the SPEC CPU2000 benchmark suite. In Proceedings of the 42nd Annual Southeast Regional Conference (ACM-SE'04). 267--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Christian Bienia. 2011. Benchmarking modern multiprocessors. Ph.D. dissertation, Princeton University, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT'08). ACM Press, New York, 72--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David Brooks, Vivek Tiwari, and Margaret Martonosi. 2000. Wattch: A framework for architectural-level power analysis and optimizations. ACM SIGARCH Comput. Archit. News 28, 2, 83--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Doug Burger and Todd M. Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH Comput. Archit. News 25, 3, 13--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11). ACM Press, New York, 52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P. Jouppi. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). IEEE Computer Society, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jorge Garcia, Jesus Corbal, Llorenc Cerda, and Mateo Valero. 2003. Design and implementation of high-performance memory systems for future packet buffers. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'03). 372--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hassan Ghasemzadeh, Sepideh Mazrouee, and Mohammad Reza Kakoee. 2006. Modified pseudo LRU replacement algorithm. In Proceedings of the 13th Annual IEEE International Symposium and Workshop on Engineering of Computer Based Systems (ECBS'06). 368--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Wim Heirman, Trevor Carlson, and Lieven Eeckhout. 2012. Sniper: Scalable and accurate parallel multi-core simulation. In Proceedings of the 8th International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems, Abstracts. High-Performance and Embedded Architecture and Compilation Network of Excellence (HiPEAC'12). 91--94.Google ScholarGoogle Scholar
  12. INTEL. 2004. Intel Xscale®Core -- Developer's Manual. Intel. http://developer.intel.com.Google ScholarGoogle Scholar
  13. INTEL. 2001. Intel ® Pentium ® 4 and Intel ® Xeon Processor Optimization -- Reference Manual. Intel. http://developer.intel.com.Google ScholarGoogle Scholar
  14. Jonas Jalminger and Per Stenstrom. 2002. Improvement of energy-efficiency in off-chip caches by selective prefetching. Microprocess. Microsyst. 26, 3, 107--121.Google ScholarGoogle ScholarCross RefCross Ref
  15. Xiaowei Jiang, Niti Madan, Li Zhao, Mike Upton, Ravishankar Iyer, Srihari Makineni, Donald Newell, Yan Solihin, and Rajeev Balasubramonian. 2010. CHOP: Adaptive filter-based DRAM caching for CMP server platforms. In Proceedings of the 16th IEEE International Symposium on High Performance Computer Architecture (HPCA'10). 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  16. Murali Kadiyala and Laxmi N. Bhuyan. 1995. A dynamic cache sub-block design to reduce false sharing. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'95). 313--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Johnson Kin, Munish Gupta, and William H. Mangione-Smith. 1997. The filter cache: An energy efficient memory structure. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO'97). IEEE Computer Society, 184--193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. John S. Liptay. 1968. Structural aspects of the system/360 model 85, II: The cache. IBM Syst. J. 7, 1, 5--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Nihar R. Mahapatra and Balakrishna Venkatrao. 1999. The processor-memory bottleneck: Problems and solutions. Crossroads Comput. Archit. 5, 3es. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mahesh Mamidipaka and Nikil Dutt. 2004. eCACTI: An enhanced power estimation model for on-chip caches. Tech. rep. TR-04-28, Center for Embedded Computer Systems. http://www.ics.uci.edu∼maheshmn/eCACTI/ecacti_tr.pdf.Google ScholarGoogle Scholar
  21. Gabriel Moruz and Andrei Negoescu. 2012. Outperforming LRU via competitive analysis on parametrized inputs for paging. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'12). SIAM, 1669--1680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. David A. Patterson and John L. Hennessy. 2008. Computer Organization and Design: The Hardware/Software Interface. Morgan Kaufmann, San Fransisco. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. ACM SIGARCH Comput. Archit. News 35, 381--391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Glen Reinman and Norman P. Jouppi. 2000. CACTI 2.0: An integrated cache timing and power model. Res. rep. 2000/7, Western Research Lab. http://www.hpl.hp.com/research/cacti/cacti2.pdf.Google ScholarGoogle Scholar
  25. Yannis Smaragdakis, Scott Kaplan, and Paul Wilson. 2003. The EELRU adaptive replacement algorithm. Perform. Eval. 53, 2, 93--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Heiko Sparenberg, Matthias Martin, and Siegfried Foessel. 2012. Introduction of eviction strategies for caching scalable media files. In Proceedings of the 7th International Conference on Digital Information Management (ICDIM'12). Simon Fong, Pit Pichappan, Sabah Mohammed, Patrick Hung, and Sohail Asghar, Eds. IEEE, 352--356. http://dblp.uni-trier.de/db/conf/icdim/icdim2012.html\#SparenbergMF12.Google ScholarGoogle ScholarCross RefCross Ref
  27. Ching-Long Su and Alvin M. Despain. 1995. Cache design trade-offs for power and performance optimization: A case study. In Proceedings of the International Symposium on Low Power Design (ISPLED'95). 63--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Josep Torrellas, Monica S. Lam, and John L. Hennessy. 1994. False sharing and spatial locality in multiprocessor caches. IEEE Trans. Comput. 43, 6, 651--663. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hao Wang, Haiquan Zhao, Bill Lin, and Jun Xu. 2010. Design and analysis of a robust pipelined memory system. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM'10). 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Wayne A. Wong and Jean-Loup Baer. 2000. Modified LRU policies for improving second-level cache behavior. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA'00). 49--60.Google ScholarGoogle Scholar
  31. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. ACM SIGARCH Comput. Archit. News 23, 24--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Chia-Linx Chia-Lin Yang and Chien-Hao Lee. 2004. HotSpot cache: Joint temporal and spatial locality exploitation for i-cache energy reduction. In Proceedings of the International Symposium on Low Power Electronics and Design (ISPLED'04). 114--119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Li Zhao, Ravi Iyer, Ramesh Illikkal, and Donald Newell. 2007. Exploring DRAM cache architectures for CMP server platforms. In Proceedings of the 25th International Conference on Computer Design (ICCD'07). 55--62.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Improving Performance in Sub-Block Caches with Optimized Replacement Policies

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Journal on Emerging Technologies in Computing Systems
        ACM Journal on Emerging Technologies in Computing Systems  Volume 11, Issue 4
        Special Issues on Neuromorphic Computing and Emerging Many-Core Systems for Exascale Computing
        April 2015
        231 pages
        ISSN:1550-4832
        EISSN:1550-4840
        DOI:10.1145/2767119
        Issue’s Table of Contents

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 April 2015
        • Accepted: 1 September 2014
        • Revised: 1 July 2014
        • Received: 1 March 2014
        Published in jetc Volume 11, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader