ABSTRACT
Recent research has proposed die-stacked Last Level Cache (LLC) to overcome the Memory Wall. Lately, Spin-Transfer-Torque Random Access Memory (STT-RAM) caches have been recommended as they provide improved energy efficiency compared to DRAM caches. However, the recently proposed STT-RAM cache architecture unnecessarily dissipates energy by fetching unneeded cache lines into the row buffer. In this paper, we propose a Selective Read Policy for STT-RAM. This is policy only fetches those cache lines into the row buffer that are likely to be reused. This is reduces the number of cache line reads and thereby reduces the energy consumption. Further, we propose two key performance optimizations namely Row Buffer Tags Bypass Policy and LLC Data Cache. Both optimizations reduce the LLC access latency and therefore improve the overall performance. For evaluation, we implement our proposed architecture in the Zesto simulator and run different combinations of SPEC2006 benchmarks on an 8-core system. We show that our synergetic policies reduce the average LLC dynamic energy consumption by 72.6% and improve the system performance by 1.3% compared to the recently proposed STT-RAM LLC. Compared to the state-of-the-art DRAM LLC, our architecture reduces the LLC dynamic energy consumption by 90.6% and improves system performance by 1.4%.
- 2013. Hybrid Memory Cube Consortium: Hybrid Memory Cube Specification. http://www.jedec.org/standards-documents/docs/jesd235. (2013).Google Scholar
- 2017. Standard Performance Evaluation Corporation. http://www.spec.org. (2017). {Online; accessed 10-March-2017}.Google Scholar
- R. X. Arroyo, R. J. Harrington, S. P. Hartman, and T. Nguyen. 2011. IBM POWER7 Systems. IBM Journal of Research and Development 55, 3 (2011), 2:1 -- 2:13. Google ScholarDigital Library
- R. Bishnoi, F. Oboril, M. Ebrahimi, and M.B. Tahoori. 2014. Avoiding Unnecessary Write Operations in STT-MRAM for Low Power Implementation. In Proceedings of the 15th International Symposium on Quality Electronic Design (ISQED'14). 548--553.Google Scholar
- X. Dong, C. Xu, Y. Xie, and N.P. Jouppi. 2012. NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 7 (July 2012), 994--1007. Google ScholarDigital Library
- Darryl Gove. 2007. CPU2006 Working Set Size. SIGARCH Computer Architecture News 35, 1 (March 2007), 90--96. Google ScholarDigital Library
- Fazal Hameed, L. Bauer, and J. Henkel. 2013. Adaptive Cache Management for a Combined SRAM and DRAM Cache Hierarchy for Multi-Cores. In Proceedings of the 15th conference on Design, Automation and Test in Europe (DATE). 77--82. Google ScholarDigital Library
- Fazal Hameed, L. Bauer, and J. Henkel. 2013. Reducing Inter-Core Cache Contention with an Adaptive Bank Mapping Policy in DRAM Cache. In IEEE International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'13). Google ScholarDigital Library
- Fazal Hameed, L. Bauer, and J. Henkel. 2013. Simultaneously Optimizing DRAM Cache Hit Latency and Miss Rate via Novel Set Mapping Policies. In International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'13). Google ScholarDigital Library
- Fazal Hameed, L. Bauer, and J. Henkel. 2014. Reducing Latency in an SRAM/DRAM Cache Hierarchy via a Novel Tag-Cache Architecture. In Proceedings of the 51st Design Automation Conference (DAC'14). Google ScholarDigital Library
- Fazal Hameed, L. Bauer, and J. Henkel. 2016. Architecting On-Chip DRAM Cache for Simultaneous Miss Rate and Latency Reduction. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 4 (April 2016), 651--664.Google ScholarDigital Library
- Fazal Hameed and Jeronimo Castrillon. 2017. Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization (to appear). In Proceedings of the 19th conference on Design, Automation and Test in Europe (DATE). Google ScholarDigital Library
- Fazal Hameed and M. B. Tahoori. 2016. Architecting STT Last-Level-Cache for Performance and Energy Improvement. In 2016 17th International Symposium on Quality Electronic Design (ISQED). 319--324.Google Scholar
- G. Hamerly, E. Perelman, J. Lau, and B. Calder. 2005. SimPoint 3.0: Faster and More Flexible Program Analysis. Journal of Instruction Level Parallelism 7 (2005).Google Scholar
- John L. Henning. 2006. SPEC CPU2006 Benchmark Descriptions. SIGARCH Computer Architecture News 34, 4 (September 2006), 1--17. Google ScholarDigital Library
- C-C. Huang and V. Nagarajan. 2014. ATCache: Reducing DRAM Cache Latency via a Small SRAM Tag Cache. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques (PACT). 51--60. Google ScholarDigital Library
- D. Jevdjic, G.H. Loh, C. Kaynak, and B. Falsafi. 2014. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 25--37. Google ScholarDigital Library
- D. Jevdjic, S. Volos, and B. Falsafi. 2013. Die-stacked DRAM caches for Servers: Hit Ratio, Latency, or Bandwidth? Have it All with Footprint Cache. In Proceedings of the 40th International Symposium on Computer Architecture (ISCA). 404--415. Google ScholarDigital Library
- X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian. 2010. CHOP: Adaptive Filter-Based DRAM Caching for CMP Server Platforms. In Proceedings of the 16th IEEE Symposium on High-Performance Computer Architecture (HPCA). 1--12.Google Scholar
- X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian. 2011. CHOP: Integrating DRAM Caches For CMP Server Platforms. IEEE Micro Magazine (Top Picks), IEEE Computer Society (2011), 99--108. Google ScholarDigital Library
- A. Jog, A.K. Mishra, Cong Xu, Y. Xie, V. Narayanan, R. Iyer, and C.R. Das. 2012. Cache Revive: Architecting Volatile STT-RAM Caches for Enhanced Performance in CMPs. In Proceedings of the 49th IEEE/ACM Design Automation Conference (DAC '12). 243--252. Google ScholarDigital Library
- U. Kang, H.-J. Chung, S. Heo, S.-H. Ahn, H. Lee, S.-H. Cha, J.D. Ahn, J.H. Kim, J.-W. Lee, H.-S. Joo, W.-S. Kim, H.-K. Kim, E.-M. Lee, S.-R. Kim, K.-H. Ma, D.-H. Jang, N.-S. Kim, M.-S. Cho, S.-J. Oh, J.-B. Lee, T.-K. Jung, J.-H. Yoo, and C. Kim. 2010. 8 Gb 3-D DDR3 DRAM using Through-Silicon-Via Technology. In IEEE Journal of Solid State Circuits, Vol. 45. 111--119.Google ScholarCross Ref
- E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu. 2013. Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative. In International Symposium on Performance Analysis of Systems and Software (ISPASS). 256--267.Google Scholar
- D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y. Kim, Y. J. Park, J. H. Kim, D. S. Kim, H. B. Park, J. W. Shin, J. H. Cho, K. H. Kwon, M. J. Kim, J. Lee, K. W. Park, B. Chung, and S. Hong. 2014. 25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV. In Proceedings of the International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). 432--433.Google Scholar
- G.H. Loh. 2009. Extending the Effectiveness of 3D Stacked DRAM Caches with an Adaptive Multi-Queue Policy. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 174--183. Google ScholarDigital Library
- G.H. Loh and M.D. Hill. 2011. Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 454--464. Google ScholarDigital Library
- G.H. Loh and M.D. Hill. 2012. Supporting Very Large DRAM Caches with Compound Access Scheduling and MissMaps. IEEE Micro Magazine, Special Issue on Top Picks in Computer Architecture Conferences 32, 3 (2012), 70--78. Google ScholarDigital Library
- G.H. Loh, S. Subramaniam, and Y. Xie. 2009. Zesto: A Cycle-Level Simulator for Highly Detailed Microarchitecture Exploration. In International Symposium on Performance Analysis of Systems and Software (ISPASS).Google Scholar
- J. Meza, L. Jing, and O. Mutlu. 2012. A Case for Small Row Buffers in Non-volatile Main Memories. In Proceedings of the 30th International Symposium on Computer Design(ICCD). 484--485. Google ScholarDigital Library
- N. Muralimanohart and N. Balasubramonian, R. and Jouppi. 2007. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 3--14. Google ScholarDigital Library
- M.K. Qureshi and G.H. Loh. 2012. Fundamental Latency Trade-offs in Architecting DRAM Caches. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 235--246. Google ScholarDigital Library
- S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. 2000. Memory Access Scheduling. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA). 128--138. Google ScholarDigital Library
- J. Sim, G.H. Loh, H. Kim, M. O??Connor, and M. Thottethodi. 2012. A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 247--257. Google ScholarDigital Library
- J. Sim, G.H. Loh, V. Sridharan, and M. O'Connor. 2013. Resilient die-stacked DRAM caches. In Proceedings of the 40th International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- C.W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, and M.R. Stan. 2011. Relaxing Non-volatility for Fast and Energy-efficient STT-RAM Caches. In Proceedings of the 17th IEEE Symposium on High-Performance Computer Architecture (HPCA). 50--61. Google ScholarDigital Library
- Z. Sun, X. Bi, H.H. Li, W-Fai Wong, Z-Liang Ong, X. Zhu, and W. Wu. 2011. Multi Retention Level STT-RAM Cache Designs with a Dynamic Refresh Scheme. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '11). 329--338. Google ScholarDigital Library
- S. Thoziyoor, J.H. Muralimanohart, R. and Ahn, and N. Jouppi. 2008. CACTI 5.1 HPL 2008/20, HP Labs. (April 2008).Google Scholar
- Christian Weis, Matthias Jung, and Nobert Wehn. 2016. 3D Memories. Book chapter in the Handbook of 3D Integration 4 (2016).Google Scholar
- D. Wendel, R. Kalla, R. Cargoni, J. Clables, J. Friedrich, R. Frech, J. Kahle, B. Sinharoy, W. Starke, S. Taylor, S. Weitzel, S.G. Chu, S. Islam, and V. Zyuban. 2010. The Implementation of POWER7TM: A Highly Parallel and Scalable Multi-core High-end Server Processor. In Proceedings of the International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). 102--103.Google Scholar
- D.H. Woo, N.H. Seong, D.L. Lewis, and H-H.S. Lee. 2010. An Optimized 3D-stacked Memory Architecture by Exploiting Excessive, High-density TSV Bandwidth. In Proceedings of the 16th IEEE Symposium on High-Performance Computer Architecture (HPCA). 1--12.Google ScholarCross Ref
- W.A. Wulf and S.A. McKee. 1995. Hitting the Memory Wall: Implications of the Obvious. Computer Architecture News 23, 1 (March 1995), 20--24. Google ScholarDigital Library
- P. Zhou, B. Zhao, J. Yang, and Y. Zhang. 2009. Energy Reduction for STT-RAM Using Early Write Termination. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD '09). 264--268. Google ScholarDigital Library
Index Terms
Efficient STT-RAM last-level-cache architecture to replace DRAM cache
Recommendations
Reducing Latency in an SRAM/DRAM Cache Hierarchy via a Novel Tag-Cache Architecture
DAC '14: Proceedings of the 51st Annual Design Automation ConferenceMemory speed has become a major performance bottleneck as more and more cores are integrated on a multi-core chip. The widening latency gap between high speed cores and memory has led to the evolution of multi-level SRAM/DRAM cache hierarchies that ...
Bypassing method for STT-RAM based inclusive last-level cache
RACS '15: Proceedings of the 2015 Conference on research in adaptive and convergent systemsNon-volatile memories (NVMs), such as STT-RAM and PCM, have recently become very competitive designs for last-level caches (LLCs). To avoid cache pollution caused by unnecessary write operations, many cache-bypassing methods have been introduced. Among ...
Coordinating prefetching and STT-RAM based last-level cache management for multicore systems
GLSVLSI '13: Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSIData prefetching is a common mechanism to mitigate the bottleneck of off-chip memory bandwidth in modern computing systems. Unfortunately, the side effects of prefetching are an additional burden on off-chip communication and increased cache write ...
Comments