Abstract
Recent advances in computer processor design have led to the introduction of sub-blocking to cache architectures. Sub-block caches reduce the tag area and power overhead in caches without reducing the effective cache size by using fewer tags to index the full data RAM array. In spite of achieving reduced area and power overhead, sub-block caches suffer performance degradation due to cache trashing. This occurs when a wider cache line (super-block), made up of multiple valid cache lines (sub-blocks), is replaced or evicted when only a sub-block is to be allocated into the wider super-block. To address this problem, we propose cache replacement policies as they relate specifically to sub-block caches. We propose new replacement policies that are tuned for sub-block caches by adding more intelligence based on the valid state of individual sub-blocks of a super-block. We also investigate the effect of using a few level-0 registers to bypass a few level-1 cache pipe stages on sub-block cache performance. To evaluate the performance improvement offered by our proposed replacement policies and the use of level-0 registers, we developed a sub-block cache simulator based on the Simplescalar toolset for single-core evaluations and the Sniper Simulator for multicore evaluations. We show that, with minimal architectural updates to existing conventional cache replacement policies, we are able to improve level-1 cache hit rates by up to 4.17% using our proposed policies alone on SPEC2006 benchmarks and up to 14% in shared level-2 caches using multicore benchmark suites: PARSEC and SPLASH2.
- Bryan Ackland, Alex Anesko, Douglas Brinthaupt, Steven J. Daubert, Asawaree Kalavade, et al. 2000. A single-chip, 1.6-billion, 16-B MAC/s multiprocessor DSP. IEEE J. Solid-State Circ. 35, 3, 412--424.Google ScholarCross Ref
- Hussein Al-Zoubi, Aleksandar Milenkovic, and Milena Milenkovic. 2004. Performance evaluation of cache replacement policies for the SPEC CPU2000 benchmark suite. In Proceedings of the 42nd Annual Southeast Regional Conference (ACM-SE'04). 267--272. Google ScholarDigital Library
- Christian Bienia. 2011. Benchmarking modern multiprocessors. Ph.D. dissertation, Princeton University, NJ. Google ScholarDigital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT'08). ACM Press, New York, 72--81. Google ScholarDigital Library
- David Brooks, Vivek Tiwari, and Margaret Martonosi. 2000. Wattch: A framework for architectural-level power analysis and optimizations. ACM SIGARCH Comput. Archit. News 28, 2, 83--94. Google ScholarDigital Library
- Doug Burger and Todd M. Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH Comput. Archit. News 25, 3, 13--25. Google ScholarDigital Library
- Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11). ACM Press, New York, 52. Google ScholarDigital Library
- Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P. Jouppi. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). IEEE Computer Society, 1--11. Google ScholarDigital Library
- Jorge Garcia, Jesus Corbal, Llorenc Cerda, and Mateo Valero. 2003. Design and implementation of high-performance memory systems for future packet buffers. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'03). 372--384. Google ScholarDigital Library
- Hassan Ghasemzadeh, Sepideh Mazrouee, and Mohammad Reza Kakoee. 2006. Modified pseudo LRU replacement algorithm. In Proceedings of the 13th Annual IEEE International Symposium and Workshop on Engineering of Computer Based Systems (ECBS'06). 368--376. Google ScholarDigital Library
- Wim Heirman, Trevor Carlson, and Lieven Eeckhout. 2012. Sniper: Scalable and accurate parallel multi-core simulation. In Proceedings of the 8th International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems, Abstracts. High-Performance and Embedded Architecture and Compilation Network of Excellence (HiPEAC'12). 91--94.Google Scholar
- INTEL. 2004. Intel Xscale®Core -- Developer's Manual. Intel. http://developer.intel.com.Google Scholar
- INTEL. 2001. Intel ® Pentium ® 4 and Intel ® Xeon Processor Optimization -- Reference Manual. Intel. http://developer.intel.com.Google Scholar
- Jonas Jalminger and Per Stenstrom. 2002. Improvement of energy-efficiency in off-chip caches by selective prefetching. Microprocess. Microsyst. 26, 3, 107--121.Google ScholarCross Ref
- Xiaowei Jiang, Niti Madan, Li Zhao, Mike Upton, Ravishankar Iyer, Srihari Makineni, Donald Newell, Yan Solihin, and Rajeev Balasubramonian. 2010. CHOP: Adaptive filter-based DRAM caching for CMP server platforms. In Proceedings of the 16th IEEE International Symposium on High Performance Computer Architecture (HPCA'10). 1--12.Google ScholarCross Ref
- Murali Kadiyala and Laxmi N. Bhuyan. 1995. A dynamic cache sub-block design to reduce false sharing. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'95). 313--318. Google ScholarDigital Library
- Johnson Kin, Munish Gupta, and William H. Mangione-Smith. 1997. The filter cache: An energy efficient memory structure. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO'97). IEEE Computer Society, 184--193. Google ScholarDigital Library
- John S. Liptay. 1968. Structural aspects of the system/360 model 85, II: The cache. IBM Syst. J. 7, 1, 5--21. Google ScholarDigital Library
- Nihar R. Mahapatra and Balakrishna Venkatrao. 1999. The processor-memory bottleneck: Problems and solutions. Crossroads Comput. Archit. 5, 3es. Google ScholarDigital Library
- Mahesh Mamidipaka and Nikil Dutt. 2004. eCACTI: An enhanced power estimation model for on-chip caches. Tech. rep. TR-04-28, Center for Embedded Computer Systems. http://www.ics.uci.edu∼maheshmn/eCACTI/ecacti_tr.pdf.Google Scholar
- Gabriel Moruz and Andrei Negoescu. 2012. Outperforming LRU via competitive analysis on parametrized inputs for paging. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'12). SIAM, 1669--1680. Google ScholarDigital Library
- David A. Patterson and John L. Hennessy. 2008. Computer Organization and Design: The Hardware/Software Interface. Morgan Kaufmann, San Fransisco. Google ScholarDigital Library
- Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. ACM SIGARCH Comput. Archit. News 35, 381--391. Google ScholarDigital Library
- Glen Reinman and Norman P. Jouppi. 2000. CACTI 2.0: An integrated cache timing and power model. Res. rep. 2000/7, Western Research Lab. http://www.hpl.hp.com/research/cacti/cacti2.pdf.Google Scholar
- Yannis Smaragdakis, Scott Kaplan, and Paul Wilson. 2003. The EELRU adaptive replacement algorithm. Perform. Eval. 53, 2, 93--123. Google ScholarDigital Library
- Heiko Sparenberg, Matthias Martin, and Siegfried Foessel. 2012. Introduction of eviction strategies for caching scalable media files. In Proceedings of the 7th International Conference on Digital Information Management (ICDIM'12). Simon Fong, Pit Pichappan, Sabah Mohammed, Patrick Hung, and Sohail Asghar, Eds. IEEE, 352--356. http://dblp.uni-trier.de/db/conf/icdim/icdim2012.html\#SparenbergMF12.Google ScholarCross Ref
- Ching-Long Su and Alvin M. Despain. 1995. Cache design trade-offs for power and performance optimization: A case study. In Proceedings of the International Symposium on Low Power Design (ISPLED'95). 63--68. Google ScholarDigital Library
- Josep Torrellas, Monica S. Lam, and John L. Hennessy. 1994. False sharing and spatial locality in multiprocessor caches. IEEE Trans. Comput. 43, 6, 651--663. Google ScholarDigital Library
- Hao Wang, Haiquan Zhao, Bill Lin, and Jun Xu. 2010. Design and analysis of a robust pipelined memory system. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM'10). 1--9. Google ScholarDigital Library
- Wayne A. Wong and Jean-Loup Baer. 2000. Modified LRU policies for improving second-level cache behavior. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA'00). 49--60.Google Scholar
- Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. ACM SIGARCH Comput. Archit. News 23, 24--36. Google ScholarDigital Library
- Chia-Linx Chia-Lin Yang and Chien-Hao Lee. 2004. HotSpot cache: Joint temporal and spatial locality exploitation for i-cache energy reduction. In Proceedings of the International Symposium on Low Power Electronics and Design (ISPLED'04). 114--119. Google ScholarDigital Library
- Li Zhao, Ravi Iyer, Ramesh Illikkal, and Donald Newell. 2007. Exploring DRAM cache architectures for CMP server platforms. In Proceedings of the 25th International Conference on Computer Design (ICCD'07). 55--62.Google ScholarCross Ref
Index Terms
- Improving Performance in Sub-Block Caches with Optimized Replacement Policies
Recommendations
Simple penalty-sensitive replacement policies for caches
CF '06: Proceedings of the 3rd conference on Computing frontiersClassic cache replacement policies assume that miss costs are uniform. However, the correlation between miss rate and cache performance is not as straightforward as it used to be. Ultimately, the true cost measure of a miss should be the penalty, i.e. ...
Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesThe replacement policies for the last-level caches (LLCs) are usually designed based on the access information available locally at the LLC. These policies are inherently sub-optimal due to lack of information about the activities in the inner-levels of ...
A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines
Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, ...
Comments