Benzene: An Energy-Efficient Distributed Hybrid Cache Architecture for Manycore Systems

Authors:
Namhyung Kim

Seoul National University, Seoul, Korea

Seoul National University, Seoul, Korea
View Profile

,
Junwhan Ahn

Seoul National University, Seoul, Korea

Seoul National University, Seoul, Korea

0000-0001-7613-0571
View Profile

,
Kiyoung Choi

Seoul National University, Seoul, Korea

Seoul National University, Seoul, Korea
View Profile

,
Daniel Sanchez

MIT, Cambridge, MA

MIT, Cambridge, MA
View Profile

,
Donghoon Yoo

Samsung Electronics, Hwaseong, Korea

Samsung Electronics, Hwaseong, Korea
View Profile

,
Soojung Ryu

Samsung Electronics, Hwaseong, Korea

Samsung Electronics, Hwaseong, Korea
View Profile

ACM Transactions on Architecture and Code Optimization Volume 15 Issue 1Article No.: 10pp 1–23https://doi.org/10.1145/3177963

Published:22 March 2018Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

This article proposes Benzene, an energy-efficient distributed SRAM/STT-RAM hybrid cache for manycore systems running multiple applications. It is based on the observation that a naïve application of hybrid cache techniques to distributed caches in a manycore architecture suffers from limited energy reduction due to uneven utilization of scarce SRAM. We propose two-level optimization techniques: intra-bank and inter-bank. Intra-bank optimization leverages highly associative cache design, achieving more uniform distribution of writes within a bank. Inter-bank optimization evenly balances the amount of write-intensive data across the banks. Our evaluation results show that Benzene significantly reduces energy consumption of distributed hybrid caches.

References

Junwhan Ahn, Sungjoo Yoo, and Kiyoung Choi. 2014. DASCA: Dead write prediction assisted STT-RAM cache architecture. In Proceedings of the International Symposium on High Performance Computer Architecture.Google ScholarCross Ref
Junwhan Ahn, Sungjoo Yoo, and Kiyoung Choi. 2016. Prediction hybrid cache: An energy-efficient STT-RAM cache architecture. IEEE Trans. Comput. 65, 3 (2016), 940--951. Google ScholarDigital Library
Jorge Albericio, Pablo Ibáñez, Víctor Viñals, and José M. Llabería. 2013. The reuse cache: Downsizing the shared last-level cache. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
Nathan Beckmann and Daniel Sanchez. 2013. Jigsaw: Scalable software-defined caches. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
Nathan Beckmann, Po-An Tsai, and Daniel Sanchez. 2015. Scaling distributed cache hierarchies through computation and data co-scheduling. In Proceedings of International Symposium in High Performance Computer Architecture.Google ScholarCross Ref
Shane Bell, Bruce Edwards, John Amann, Rich Conlin, Kevin Joyce, Vince Leung, John MacKay, Mike Reif, Liewei Bao, John Brown, Matthew Mattina, Chyi-Chang Miao, Carl Ramey, David Wentzlaff, Walker Anderson, Ethan Berger, Nat Fairbanks, Durlov Khan, Froilan Montenegro, Jay Stickney, and John Zook. 2008. TILE64-processor: A 64-core SoC with mesh interconnect. In International Solid-State Circuits Conference Digest of Technical Papers.Google ScholarCross Ref
Xiuyuan Bi, Zhenyu Sun, Hai Li, and Wenqing Wu. 2012. Probabilistic design methodology to improve run-time stability and performance of STT-RAM caches. In Proceedings of the International Conference on Computer-Aided Design. Google ScholarDigital Library
Yu-Ting Chen, Jason Cong, Hui Huang, Chunyue Liu, Raghu Prabhakar, and Glenn Reinman. 2012. Static and dynamic co-optimizations for blocks mapping in hybrid caches. In Proceedings of the International Symposium on Low Power Electronics and Design. Google ScholarDigital Library
Hsiang-Yun Cheng, Jishen Zhao, Jack Sampson, Mary Jane Irwin, Aamer Jaleel, Yu Lu, and Yuan Xie. 2016. LAP: Loop-block aware inclusion properties for energy-efficient asymmetric last level caches. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
Derek Chiou, Prabhat Jain, Srinivas Devadas, and Larry Rudolph. 2000. Dynamic cache partitioning via columnization. In Proceedings of Design Automation Conference.Google Scholar
Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2003. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in CMPs. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
George Chrysos. 2012. Intel® Xeon Phi coprocessor (codename Knights Corner). In IEEE Hot Chips Symposium.Google ScholarCross Ref
Xiangyu Dong, Xiaoxia Wu, Guangyu Sun, Yuan Xie, Hai Li, and Yiran Chen. 2008. Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. In Proceedings of the Design Automation Conference. Google ScholarDigital Library
Xiangyu Dong, Cong Xu, Yuan Xie, and Norman P. Jouppi. 2012. NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 31, 7 (2012), 994--1007. Google ScholarDigital Library
John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Arch. News 34, 4 (2006), 1--17. Google ScholarDigital Library
Adwait Jog, Asit K. Mishra, Cong Xu, Yuan Xie, Vijaykrishnan Narayanan, Ravishankar Iyer, and Chita R. Das. 2012. Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs. In Proceedings of the Design Automation Conference. Google ScholarDigital Library
Georgios Keramidas, Pavlos Petoumenos, and Stefanos Kaxiras. 2007. Cache replacement based on reuse-distance prediction. In Proceedings of the International Conference on Computer Design.Google ScholarCross Ref
Samira M. Khan, Yingying Tian, and Daniel A. Jimenez. 2010. Sampling dead block prediction for last-level caches. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers. 2011. CloudCache: Expanding and shrinking private caches. In Proceedings of the International Symposium on High Performance Computer Architecture. Google ScholarDigital Library
Jianhua Li, Liang Shi, Chun Jason Xue, Chengmo Yang, and Yinlong Xu. 2011. Exploiting set-level write non-uniformity for energy-efficient NVM-based hybrid cache. In Proceedings of the Symposium on Embedded Systems for Real-Time Multimedia.Google ScholarCross Ref
Qingan Li, Jianhua Li, Liang Shi, Chun Jason Xue, and Yanxiang He. 2012. MAC: Migration-aware compilation for STT-RAM based hybrid cache in embedded systems. In Proceedings of the International Symposium on Low Power Electronics and Design. Google ScholarDigital Library
Qingan Li, Mengying Zhao, Chun Jason Xue, and Yanxiang He. 2012. Compiler-assisted preferred caching for embedded systems with STT-RAM based hybrid cache. In Proceedings of the International Conference on Languages, Compilers, Tools and Theory for Embedded Systems. Google ScholarDigital Library
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2013. The McPAT framework formulticore and manycore architectures: Simultaneously modeling power, area, and timing. ACM Trans. Arch. Code Optim. 10, 1 (2013), 5:1--5:29. Google ScholarDigital Library
Asit K. Mishra, Xiangyu Dong, Guangyu Sun, Yuan Xie, Vijaykrishnan Narayanan, and Chita R. Das. 2011. Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs. In Proceedings of International Symposium in Computer Architecture. Google ScholarDigital Library
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report HPL-2009-85. HP Laboratories.Google Scholar
Rasmus Pagh and Flemming Friche Rodler. 2001. Cuckoo hashing. In Proceedings of the European Symposium on Algorithms. Google ScholarDigital Library
Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt. 2006. A case for MLP-aware cache replacement. In Proceedings of International Symposium in Computer Architecture. Google ScholarDigital Library
Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
Moinuddin K. Qureshi, David Thompson, and Yale N. Patt. 2005. The V-Way cache: Demand-based associativity via global replacement. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
Daniel Sanchez and Christos Kozyrakis. 2010. The ZCache: Decoupling ways and associativity. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
Daniel Sanchez and Christos Kozyrakis. 2011. Vantage: Scalable and efficient fine-grain cache partitioning. In Proceedings of International Symposium in Computer Architecture. Google ScholarDigital Library
Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of International Symposium in Computer Architecture. Google ScholarDigital Library
André Seznec. 1993. A case for two-way skewed-associative caches. In Proceedings of International Symposium in Computer Architecture. Google ScholarDigital Library
Clinton W. Smullen IV, Vidyabhushan Mohan, Anurag Nigam, Sudhanva Gurumurthi, and Mircea R. Stan. 2011. Relaxing non-volatility for fast and energy-efficient STT-RAM caches. In Proceedings of the International Symposium on High Performance Computer Architecture. Google ScholarDigital Library
Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT-A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proceedings of the International Symposium on Networks on Chip. Google ScholarDigital Library
Guangyu Sun, Xiangyu Dong, Yuan Xie, Jian Li, and Yiran Chen. 2009. A novel architecture of the 3D stacked MRAM L2 cache for CMPs. In Proceedings of the International Symposium on High Performance Computer Architecture.Google ScholarCross Ref
Zhenyu Sun, Xiuyuan Bi, Hai Li, Weng-Fai Wong, Zhong-Liang Ong, Xiaochun Zhu, and Wenqing Wu. 2011. Multi retention level STT-RAM cache designs with a dynamic refresh scheme. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
Jue Wang, Xiangyu Dong, and Yuan Xie. 2013. OAP: An obstruction-aware cache management policy for STT-RAM last-level caches. In Proceedings of the Design, Automation and Test in Europe. Google ScholarDigital Library
Zhe Wang, Daniel A. Jimenez, Cong Xu, Guangyu Sun, and Yuan Xie. 2013. Adaptive placement and migration policy for an STT-RAM-based hybrid cache. In Proceedings of the International Symposium on High Performance Computer Architecture.Google Scholar
Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, and Yuan Xie. 2009. Hybrid cache architecture with disparate memory technologies. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, and Yuan Xie. 2011. Power and performance of read-write aware hybrid caches with non-volatile memories. In Proceedings of the Design, Automation and Test in Europe. Google ScholarDigital Library
Yuejian Xie and Gabriel H. Loh. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
Tianhao Zheng, Jaeyoung Park, Michael Orshansky, and Mattan Erez. 2013. Variable-energy write STT-RAM architecture with bit-wise write-completion monitoring. In Proceedings of the International Symposium on Low Power Electronics and Design. Google ScholarDigital Library
Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. 2009. Energy reduction for STT-RAM using early write termination. In Proceedings of the International Conference on Computer-Aided Design. Google ScholarDigital Library

Index Terms

Benzene: An Energy-Efficient Distributed Hybrid Cache Architecture for Manycore Systems
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Hardware

Recommendations

Improving the Performance of Hybrid Caches Using Partitioned Victim Caching

Non-Volatile Memory technologies are coming as a viable option on account of the high density and low-leakage power over the conventional SRAM counterpart. However, the increased write latency reduces their chances as a substitute for SRAM. To attenuate ...
Read More
High-endurance hybrid cache design in CMP architecture with cache partitioning and access-aware policy
GLSVLSI '13: Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI

In recent years, NVM (non-volatile memory) technologies, such as STT-RAM (spin transfer torque RAM) and PRAM (phase change RAM), have drawn a lot of attention due to their low leakage and high density. However, both NVMs suffer from high write latency ...
Read More
SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU---GPU heterogeneous architectures

Shared last-level cache (LLC) in on-chip CPU---GPU heterogeneous architectures is critical to the overall system performance, since CPU and GPU applications usually show completely different characteristics on cache accesses. Therefore, when co-running ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 15, Issue 1
March 2018
401 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3199680
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 March 2018
- Accepted: 1 December 2017
- Revised: 1 November 2017
- Received: 1 May 2017
Published in taco Volume 15, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Manycore systems
STT-RAM
distributed
energy-efficient
hybrid cache
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 667
  Total Downloads
- Downloads (Last 12 months)64
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Benzene: An Energy-Efficient Distributed Hybrid Cache Architecture for Manycore Systems

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Improving the Performance of Hybrid Caches Using Partitioned Victim Caching

High-endurance hybrid cache design in CMP architecture with cache partitioning and access-aware policy

SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU---GPU heterogeneous architectures