research-article

Improving Performance in Sub-Block Caches with Optimized Replacement Policies

Authors:
Oluleye Olorode

The University of Texas at Dallas, Richardson, TX

The University of Texas at Dallas, Richardson, TX
View Profile

,
Mehrdad Nourani

The University of Texas at Dallas, Richardson, TX

The University of Texas at Dallas, Richardson, TX
View Profile

ACM Journal on Emerging Technologies in Computing Systems Volume 11 Issue 4Article No.: 41pp 1–22https://doi.org/10.1145/2668127

Published:27 April 2015Publication History

ACM Journal on Emerging Technologies in Computing Systems

Abstract

Recent advances in computer processor design have led to the introduction of sub-blocking to cache architectures. Sub-block caches reduce the tag area and power overhead in caches without reducing the effective cache size by using fewer tags to index the full data RAM array. In spite of achieving reduced area and power overhead, sub-block caches suffer performance degradation due to cache trashing. This occurs when a wider cache line (super-block), made up of multiple valid cache lines (sub-blocks), is replaced or evicted when only a sub-block is to be allocated into the wider super-block. To address this problem, we propose cache replacement policies as they relate specifically to sub-block caches. We propose new replacement policies that are tuned for sub-block caches by adding more intelligence based on the valid state of individual sub-blocks of a super-block. We also investigate the effect of using a few level-0 registers to bypass a few level-1 cache pipe stages on sub-block cache performance. To evaluate the performance improvement offered by our proposed replacement policies and the use of level-0 registers, we developed a sub-block cache simulator based on the Simplescalar toolset for single-core evaluations and the Sniper Simulator for multicore evaluations. We show that, with minimal architectural updates to existing conventional cache replacement policies, we are able to improve level-1 cache hit rates by up to 4.17% using our proposed policies alone on SPEC2006 benchmarks and up to 14% in shared level-2 caches using multicore benchmark suites: PARSEC and SPLASH2.

References

Bryan Ackland, Alex Anesko, Douglas Brinthaupt, Steven J. Daubert, Asawaree Kalavade, et al. 2000. A single-chip, 1.6-billion, 16-B MAC/s multiprocessor DSP. IEEE J. Solid-State Circ. 35, 3, 412--424.Google ScholarCross Ref
Hussein Al-Zoubi, Aleksandar Milenkovic, and Milena Milenkovic. 2004. Performance evaluation of cache replacement policies for the SPEC CPU2000 benchmark suite. In Proceedings of the 42^nd Annual Southeast Regional Conference (ACM-SE'04). 267--272. Google ScholarDigital Library
Christian Bienia. 2011. Benchmarking modern multiprocessors. Ph.D. dissertation, Princeton University, NJ. Google ScholarDigital Library
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17^th International Conference on Parallel Architectures and Compilation Techniques (PACT'08). ACM Press, New York, 72--81. Google ScholarDigital Library
David Brooks, Vivek Tiwari, and Margaret Martonosi. 2000. Wattch: A framework for architectural-level power analysis and optimizations. ACM SIGARCH Comput. Archit. News 28, 2, 83--94. Google ScholarDigital Library
Doug Burger and Todd M. Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH Comput. Archit. News 25, 3, 13--25. Google ScholarDigital Library
Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11). ACM Press, New York, 52. Google ScholarDigital Library
Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P. Jouppi. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). IEEE Computer Society, 1--11. Google ScholarDigital Library
Jorge Garcia, Jesus Corbal, Llorenc Cerda, and Mateo Valero. 2003. Design and implementation of high-performance memory systems for future packet buffers. In Proceedings of the 36^th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'03). 372--384. Google ScholarDigital Library
Hassan Ghasemzadeh, Sepideh Mazrouee, and Mohammad Reza Kakoee. 2006. Modified pseudo LRU replacement algorithm. In Proceedings of the 13^th Annual IEEE International Symposium and Workshop on Engineering of Computer Based Systems (ECBS'06). 368--376. Google ScholarDigital Library
Wim Heirman, Trevor Carlson, and Lieven Eeckhout. 2012. Sniper: Scalable and accurate parallel multi-core simulation. In Proceedings of the 8^th International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems, Abstracts. High-Performance and Embedded Architecture and Compilation Network of Excellence (HiPEAC'12). 91--94.Google Scholar
INTEL. 2004. Intel Xscale®Core -- Developer's Manual. Intel. http://developer.intel.com.Google Scholar
INTEL. 2001. Intel ® Pentium ® 4 and Intel ® Xeon Processor Optimization -- Reference Manual. Intel. http://developer.intel.com.Google Scholar
Jonas Jalminger and Per Stenstrom. 2002. Improvement of energy-efficiency in off-chip caches by selective prefetching. Microprocess. Microsyst. 26, 3, 107--121.Google ScholarCross Ref
Xiaowei Jiang, Niti Madan, Li Zhao, Mike Upton, Ravishankar Iyer, Srihari Makineni, Donald Newell, Yan Solihin, and Rajeev Balasubramonian. 2010. CHOP: Adaptive filter-based DRAM caching for CMP server platforms. In Proceedings of the 16^th IEEE International Symposium on High Performance Computer Architecture (HPCA'10). 1--12.Google ScholarCross Ref
Murali Kadiyala and Laxmi N. Bhuyan. 1995. A dynamic cache sub-block design to reduce false sharing. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'95). 313--318. Google ScholarDigital Library
Johnson Kin, Munish Gupta, and William H. Mangione-Smith. 1997. The filter cache: An energy efficient memory structure. In Proceedings of the 30^th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO'97). IEEE Computer Society, 184--193. Google ScholarDigital Library
John S. Liptay. 1968. Structural aspects of the system/360 model 85, II: The cache. IBM Syst. J. 7, 1, 5--21. Google ScholarDigital Library
Nihar R. Mahapatra and Balakrishna Venkatrao. 1999. The processor-memory bottleneck: Problems and solutions. Crossroads Comput. Archit. 5, 3es. Google ScholarDigital Library
Mahesh Mamidipaka and Nikil Dutt. 2004. eCACTI: An enhanced power estimation model for on-chip caches. Tech. rep. TR-04-28, Center for Embedded Computer Systems. http://www.ics.uci.edu&sim;maheshmn/eCACTI/ecacti&lowbar;tr.pdf.Google Scholar
Gabriel Moruz and Andrei Negoescu. 2012. Outperforming LRU via competitive analysis on parametrized inputs for paging. In Proceedings of the 23^rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'12). SIAM, 1669--1680. Google ScholarDigital Library
David A. Patterson and John L. Hennessy. 2008. Computer Organization and Design: The Hardware/Software Interface. Morgan Kaufmann, San Fransisco. Google ScholarDigital Library
Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. ACM SIGARCH Comput. Archit. News 35, 381--391. Google ScholarDigital Library
Glen Reinman and Norman P. Jouppi. 2000. CACTI 2.0: An integrated cache timing and power model. Res. rep. 2000/7, Western Research Lab. http://www.hpl.hp.com/research/cacti/cacti2.pdf.Google Scholar
Yannis Smaragdakis, Scott Kaplan, and Paul Wilson. 2003. The EELRU adaptive replacement algorithm. Perform. Eval. 53, 2, 93--123. Google ScholarDigital Library
Heiko Sparenberg, Matthias Martin, and Siegfried Foessel. 2012. Introduction of eviction strategies for caching scalable media files. In Proceedings of the 7^th International Conference on Digital Information Management (ICDIM'12). Simon Fong, Pit Pichappan, Sabah Mohammed, Patrick Hung, and Sohail Asghar, Eds. IEEE, 352--356. http://dblp.uni-trier.de/db/conf/icdim/icdim2012.html\#SparenbergMF12.Google ScholarCross Ref
Ching-Long Su and Alvin M. Despain. 1995. Cache design trade-offs for power and performance optimization: A case study. In Proceedings of the International Symposium on Low Power Design (ISPLED'95). 63--68. Google ScholarDigital Library
Josep Torrellas, Monica S. Lam, and John L. Hennessy. 1994. False sharing and spatial locality in multiprocessor caches. IEEE Trans. Comput. 43, 6, 651--663. Google ScholarDigital Library
Hao Wang, Haiquan Zhao, Bill Lin, and Jun Xu. 2010. Design and analysis of a robust pipelined memory system. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM'10). 1--9. Google ScholarDigital Library
Wayne A. Wong and Jean-Loup Baer. 2000. Modified LRU policies for improving second-level cache behavior. In Proceedings of the 6^th International Symposium on High-Performance Computer Architecture (HPCA'00). 49--60.Google Scholar
Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. ACM SIGARCH Comput. Archit. News 23, 24--36. Google ScholarDigital Library
Chia-Linx Chia-Lin Yang and Chien-Hao Lee. 2004. HotSpot cache: Joint temporal and spatial locality exploitation for i-cache energy reduction. In Proceedings of the International Symposium on Low Power Electronics and Design (ISPLED'04). 114--119. Google ScholarDigital Library
Li Zhao, Ravi Iyer, Ramesh Illikkal, and Donald Newell. 2007. Exploring DRAM cache architectures for CMP server platforms. In Proceedings of the 25^th International Conference on Computer Design (ICCD'07). 55--62.Google ScholarCross Ref

Index Terms

Improving Performance in Sub-Block Caches with Optimized Replacement Policies
1. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis
  2. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Simple penalty-sensitive replacement policies for caches
CF '06: Proceedings of the 3rd conference on Computing frontiers

Classic cache replacement policies assume that miss costs are uniform. However, the correlation between miss rate and cache performance is not as straightforward as it used to be. Ultimately, the true cost measure of a miss should be the penalty, i.e. ...
Read More
Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

The replacement policies for the last-level caches (LLCs) are usually designed based on the access information available locally at the LLC. These policies are inherently sub-optimal due to lack of information about the activities in the inner-levels of ...
Read More
A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines

Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Journal on Emerging Technologies in Computing Systems Volume 11, Issue 4
Special Issues on Neuromorphic Computing and Emerging Many-Core Systems for Exascale Computing
April 2015
231 pages
ISSN:1550-4832
EISSN:1550-4840
DOI:10.1145/2767119
Editor:
Krishnendu Chakrabarty
Duke University, USA
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 27 April 2015
- Accepted: 1 September 2014
- Revised: 1 July 2014
- Received: 1 March 2014
Published in jetc Volume 11, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cache architecture
memory system
multicore
performance
replacement policy
simulation
static and dynamic power
sub-block caching
systems
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 432
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Improving Performance in Sub-Block Caches with Optimized Replacement Policies

ACM Journal on Emerging Technologies in Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Simple penalty-sensitive replacement policies for caches

Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches

A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines