Abstract
This article proposes Benzene, an energy-efficient distributed SRAM/STT-RAM hybrid cache for manycore systems running multiple applications. It is based on the observation that a naïve application of hybrid cache techniques to distributed caches in a manycore architecture suffers from limited energy reduction due to uneven utilization of scarce SRAM. We propose two-level optimization techniques: intra-bank and inter-bank. Intra-bank optimization leverages highly associative cache design, achieving more uniform distribution of writes within a bank. Inter-bank optimization evenly balances the amount of write-intensive data across the banks. Our evaluation results show that Benzene significantly reduces energy consumption of distributed hybrid caches.
- Junwhan Ahn, Sungjoo Yoo, and Kiyoung Choi. 2014. DASCA: Dead write prediction assisted STT-RAM cache architecture. In Proceedings of the International Symposium on High Performance Computer Architecture.Google ScholarCross Ref
- Junwhan Ahn, Sungjoo Yoo, and Kiyoung Choi. 2016. Prediction hybrid cache: An energy-efficient STT-RAM cache architecture. IEEE Trans. Comput. 65, 3 (2016), 940--951. Google ScholarDigital Library
- Jorge Albericio, Pablo Ibáñez, Víctor Viñals, and José M. Llabería. 2013. The reuse cache: Downsizing the shared last-level cache. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
- Nathan Beckmann and Daniel Sanchez. 2013. Jigsaw: Scalable software-defined caches. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
- Nathan Beckmann, Po-An Tsai, and Daniel Sanchez. 2015. Scaling distributed cache hierarchies through computation and data co-scheduling. In Proceedings of International Symposium in High Performance Computer Architecture.Google ScholarCross Ref
- Shane Bell, Bruce Edwards, John Amann, Rich Conlin, Kevin Joyce, Vince Leung, John MacKay, Mike Reif, Liewei Bao, John Brown, Matthew Mattina, Chyi-Chang Miao, Carl Ramey, David Wentzlaff, Walker Anderson, Ethan Berger, Nat Fairbanks, Durlov Khan, Froilan Montenegro, Jay Stickney, and John Zook. 2008. TILE64-processor: A 64-core SoC with mesh interconnect. In International Solid-State Circuits Conference Digest of Technical Papers.Google ScholarCross Ref
- Xiuyuan Bi, Zhenyu Sun, Hai Li, and Wenqing Wu. 2012. Probabilistic design methodology to improve run-time stability and performance of STT-RAM caches. In Proceedings of the International Conference on Computer-Aided Design. Google ScholarDigital Library
- Yu-Ting Chen, Jason Cong, Hui Huang, Chunyue Liu, Raghu Prabhakar, and Glenn Reinman. 2012. Static and dynamic co-optimizations for blocks mapping in hybrid caches. In Proceedings of the International Symposium on Low Power Electronics and Design. Google ScholarDigital Library
- Hsiang-Yun Cheng, Jishen Zhao, Jack Sampson, Mary Jane Irwin, Aamer Jaleel, Yu Lu, and Yuan Xie. 2016. LAP: Loop-block aware inclusion properties for energy-efficient asymmetric last level caches. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
- Derek Chiou, Prabhat Jain, Srinivas Devadas, and Larry Rudolph. 2000. Dynamic cache partitioning via columnization. In Proceedings of Design Automation Conference.Google Scholar
- Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2003. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
- Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in CMPs. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
- George Chrysos. 2012. Intel® Xeon Phi coprocessor (codename Knights Corner). In IEEE Hot Chips Symposium.Google ScholarCross Ref
- Xiangyu Dong, Xiaoxia Wu, Guangyu Sun, Yuan Xie, Hai Li, and Yiran Chen. 2008. Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. In Proceedings of the Design Automation Conference. Google ScholarDigital Library
- Xiangyu Dong, Cong Xu, Yuan Xie, and Norman P. Jouppi. 2012. NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 31, 7 (2012), 994--1007. Google ScholarDigital Library
- John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Arch. News 34, 4 (2006), 1--17. Google ScholarDigital Library
- Adwait Jog, Asit K. Mishra, Cong Xu, Yuan Xie, Vijaykrishnan Narayanan, Ravishankar Iyer, and Chita R. Das. 2012. Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs. In Proceedings of the Design Automation Conference. Google ScholarDigital Library
- Georgios Keramidas, Pavlos Petoumenos, and Stefanos Kaxiras. 2007. Cache replacement based on reuse-distance prediction. In Proceedings of the International Conference on Computer Design.Google ScholarCross Ref
- Samira M. Khan, Yingying Tian, and Daniel A. Jimenez. 2010. Sampling dead block prediction for last-level caches. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
- Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers. 2011. CloudCache: Expanding and shrinking private caches. In Proceedings of the International Symposium on High Performance Computer Architecture. Google ScholarDigital Library
- Jianhua Li, Liang Shi, Chun Jason Xue, Chengmo Yang, and Yinlong Xu. 2011. Exploiting set-level write non-uniformity for energy-efficient NVM-based hybrid cache. In Proceedings of the Symposium on Embedded Systems for Real-Time Multimedia.Google ScholarCross Ref
- Qingan Li, Jianhua Li, Liang Shi, Chun Jason Xue, and Yanxiang He. 2012. MAC: Migration-aware compilation for STT-RAM based hybrid cache in embedded systems. In Proceedings of the International Symposium on Low Power Electronics and Design. Google ScholarDigital Library
- Qingan Li, Mengying Zhao, Chun Jason Xue, and Yanxiang He. 2012. Compiler-assisted preferred caching for embedded systems with STT-RAM based hybrid cache. In Proceedings of the International Conference on Languages, Compilers, Tools and Theory for Embedded Systems. Google ScholarDigital Library
- Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2013. The McPAT framework formulticore and manycore architectures: Simultaneously modeling power, area, and timing. ACM Trans. Arch. Code Optim. 10, 1 (2013), 5:1--5:29. Google ScholarDigital Library
- Asit K. Mishra, Xiangyu Dong, Guangyu Sun, Yuan Xie, Vijaykrishnan Narayanan, and Chita R. Das. 2011. Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs. In Proceedings of International Symposium in Computer Architecture. Google ScholarDigital Library
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report HPL-2009-85. HP Laboratories.Google Scholar
- Rasmus Pagh and Flemming Friche Rodler. 2001. Cuckoo hashing. In Proceedings of the European Symposium on Algorithms. Google ScholarDigital Library
- Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt. 2006. A case for MLP-aware cache replacement. In Proceedings of International Symposium in Computer Architecture. Google ScholarDigital Library
- Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
- Moinuddin K. Qureshi, David Thompson, and Yale N. Patt. 2005. The V-Way cache: Demand-based associativity via global replacement. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
- Daniel Sanchez and Christos Kozyrakis. 2010. The ZCache: Decoupling ways and associativity. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
- Daniel Sanchez and Christos Kozyrakis. 2011. Vantage: Scalable and efficient fine-grain cache partitioning. In Proceedings of International Symposium in Computer Architecture. Google ScholarDigital Library
- Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of International Symposium in Computer Architecture. Google ScholarDigital Library
- André Seznec. 1993. A case for two-way skewed-associative caches. In Proceedings of International Symposium in Computer Architecture. Google ScholarDigital Library
- Clinton W. Smullen IV, Vidyabhushan Mohan, Anurag Nigam, Sudhanva Gurumurthi, and Mircea R. Stan. 2011. Relaxing non-volatility for fast and energy-efficient STT-RAM caches. In Proceedings of the International Symposium on High Performance Computer Architecture. Google ScholarDigital Library
- Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT-A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proceedings of the International Symposium on Networks on Chip. Google ScholarDigital Library
- Guangyu Sun, Xiangyu Dong, Yuan Xie, Jian Li, and Yiran Chen. 2009. A novel architecture of the 3D stacked MRAM L2 cache for CMPs. In Proceedings of the International Symposium on High Performance Computer Architecture.Google ScholarCross Ref
- Zhenyu Sun, Xiuyuan Bi, Hai Li, Weng-Fai Wong, Zhong-Liang Ong, Xiaochun Zhu, and Wenqing Wu. 2011. Multi retention level STT-RAM cache designs with a dynamic refresh scheme. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
- Jue Wang, Xiangyu Dong, and Yuan Xie. 2013. OAP: An obstruction-aware cache management policy for STT-RAM last-level caches. In Proceedings of the Design, Automation and Test in Europe. Google ScholarDigital Library
- Zhe Wang, Daniel A. Jimenez, Cong Xu, Guangyu Sun, and Yuan Xie. 2013. Adaptive placement and migration policy for an STT-RAM-based hybrid cache. In Proceedings of the International Symposium on High Performance Computer Architecture.Google Scholar
- Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, and Yuan Xie. 2009. Hybrid cache architecture with disparate memory technologies. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
- Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, and Yuan Xie. 2011. Power and performance of read-write aware hybrid caches with non-volatile memories. In Proceedings of the Design, Automation and Test in Europe. Google ScholarDigital Library
- Yuejian Xie and Gabriel H. Loh. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
- Tianhao Zheng, Jaeyoung Park, Michael Orshansky, and Mattan Erez. 2013. Variable-energy write STT-RAM architecture with bit-wise write-completion monitoring. In Proceedings of the International Symposium on Low Power Electronics and Design. Google ScholarDigital Library
- Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. 2009. Energy reduction for STT-RAM using early write termination. In Proceedings of the International Conference on Computer-Aided Design. Google ScholarDigital Library
Index Terms
- Benzene: An Energy-Efficient Distributed Hybrid Cache Architecture for Manycore Systems
Recommendations
Improving the Performance of Hybrid Caches Using Partitioned Victim Caching
Non-Volatile Memory technologies are coming as a viable option on account of the high density and low-leakage power over the conventional SRAM counterpart. However, the increased write latency reduces their chances as a substitute for SRAM. To attenuate ...
High-endurance hybrid cache design in CMP architecture with cache partitioning and access-aware policy
GLSVLSI '13: Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSIIn recent years, NVM (non-volatile memory) technologies, such as STT-RAM (spin transfer torque RAM) and PRAM (phase change RAM), have drawn a lot of attention due to their low leakage and high density. However, both NVMs suffer from high write latency ...
SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU---GPU heterogeneous architectures
Shared last-level cache (LLC) in on-chip CPU---GPU heterogeneous architectures is critical to the overall system performance, since CPU and GPU applications usually show completely different characteristics on cache accesses. Therefore, when co-running ...
Comments