ABSTRACT
Convolutional neural networks (CNNs) have been proposed to be widely adopted to make predictions on a large amount of data in modern embedded systems. Multiply and accumulate (MAC) operations serve as the most computationally expensive portion in CNN. Compared to the manner of executing MAC operations in GPU and FPGA, CNN implementation in the RRAM crossbar-based computing system (RCS) demonstrates the outstanding advantages of high performance and low power. However, the current design presents a very high overhead on peripheral circuits and memory accesses, limiting the gains of RCS.
Addressing the problem, recently a Multi-CLP (Convolutional Layer Processor) structure has been proposed, where the FPGA controlling resources can be shared by multiple computation units. Exploiting this idea, the Peripheral Circuit Unit (PeriCU)-Reuse scheme has been proposed, with the underlying idea is to put the expensive AD/DAs onto spotlight and arrange multiple convolution layers to be sequentially served by the same PeriCU. This paper adopts the above structures. It is further observed that memory accesses can be bypassed if two adjacent layers are assigned in different CLPs. A loop tiling technique is proposed to enable memory accesses bypassing and further improve the energy of RCS. And to guarantee correct data dependency between layers, the safe starting time for a layer is discussed if its previous layer is tiled in a different CLP. The experiments of two convolutional applications validate that the loop tiling technique integrated with the Multi-CLP structure can efficiently meet power budgets and further reduce energy consumption by 61.7%.
- M. Alwani, H. Chen, M. Ferdman, and P. Milder. 2016. Fused-layer CNN accelerators. In MICRO'16. 1--12. Google ScholarDigital Library
- T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In ASPLOS'14. 269--284. Google ScholarDigital Library
- P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie. 2016. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In ISCA'16. 27--39. Google ScholarDigital Library
- T. Geng, L. Waeijen, M. Peemen, H. Corporaal, and Y. He. 2016. MacSim: A MAC-Enabled High-Performance Low-Power SIMD Architecture. In DSD'16. 160--167.Google Scholar
- B. Li, X. Xia, P. Gu, Y. Wang, and H. Yang. 2015. Merging the Interface: Power, Area and Accuracy Co-optimization for RRAM Crossbar-based Mixed-signal Computing System. In DAC'15. 13:1--13:6. Google ScholarDigital Library
- H. Li, Y. Chen, C. Liu, J. P. Strachan, and N. Davila. 2017. Looking Ahead for Resistive Memory Technology: A broad perspective on ReRAM technology for future storage and computing. IEEE Consumer Electronics Magazine 6, 1 (2017), 94--103.Google ScholarCross Ref
- Y. Ni, W. Chen, W. Cui, Y. Zhou, and K. Qiu. {n. d.}. Power Optimization Through Peripheral Circuit Reusing Integrated with Loop Tiling for RRAM Crossbar-based CNN. In DATE' 18. 1183--1186.Google Scholar
- P. Panda, A. Sengupta, S. S. Sarwar, G. Srinivasan, S. Venkataramani, A. Raghunathan, and K. Roy. 2016. Cross-layer approximations for neuromorphic computing: From devices to circuits and systems. In DAC'16. 1--6. Google ScholarDigital Library
- M. Peemen, A. Setio, B. Mesman, and H. Corporaal. 2013. Memory-centric accelerator design for Convolutional Neural Networks. In ICCD'13. 13--19.Google Scholar
- K. Qiu, W. Chen, Y. Xu, L. Xia, Y. Wang, and Z. Shao. {n. d.}. A Peripheral Circuit Reuse Structure Integrated with a Re-timed Data Flow for Low Power RRAM Crossbar-based CNN. In DATE' 18. 1057 -- 1062.Google Scholar
- Y. Shen, M. Ferdman, and P. Milder. 2017. Maximizing CNN Accelerator Efficiency Through Resource Partitioning. In ISCA '17. 535--547. Google ScholarDigital Library
- F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei. 2017. Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns. IEEE Transactions on Very Large Scale Integration Systems(TVLSI) 25, 8 (2017), 2220--2233.Google ScholarDigital Library
- L. Xia, T. Tang, W. Huangfu, M. Cheng, X. Yin, B. Li, Y. Wang, and H. Yang. 2016. Switched by input: Power efficient structure for RRAM-based convolutional neural network. In DAC'16. 1--6. Google ScholarDigital Library
- C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In FPGA'15. 161--170. Google ScholarDigital Library
Index Terms
- Low power driven loop tiling for RRAM crossbar-based CNN
Recommendations
Write Mode Aware Loop Tiling for High Performance Low Power Volatile PCM
DAC '14: Proceedings of the 51st Annual Design Automation ConferenceArchitecting PCM, especially MLC PCM, as main memory for MCUs is a promising technique to replace conventional DRAM deployment. However, PCM/MLC PCM suffers from long write latency and large write energy. Recent work has proposed a compiler directed ...
Loop interchange and tiling for multi-dimensional loops to minimize write operations on NVMs
AbstractNon-volatile memory (NVM) is expected to be the second tier of memory in two-tier memory systems. However, because of the limited write endurance, it is vital to reduce the number of writes on NVM. Large-scale nested loops are the ...
Optimal Loop Tiling for Minimizing Write Operations on NVMs with Complete Memory Latency Hiding
ASPDAC '22: Proceedings of the 27th Asia and South Pacific Design Automation ConferenceNon-volatile memory (NVM) is expected to be the second level memory (named remote memory) in two-level memory hierarchy in the future. However, NVM has the limited write endurance, thus it is vital to reduce the number of write operations on NVM. ...
Comments