ABSTRACT
In recent years, designing specialized manycore heterogeneous architectures for deep learning kernels has become an area of great interest. However, the typical on-chip communication infrastructures employed on conventional manycore platforms are unable to handle both CPU and GPU communication requirements efficiently. Hence, in this paper, our aim is to enhance the performance of heterogeneous manycore architectures through the design of a hybrid NoC consisting of both wireline and wireless links. To this end, we specifically target the resource-intensive backpropagation algorithm commonly used as the training method in deep learning. For backpropagation, the proposed hybrid NoC achieves 1.9X reduction in network latency and improves the network throughput by a factor of 2 with respect to a highly optimized mesh NoC. These network level improvements translate into 25% savings in full system energy-delay-product (EDP). This demonstrates the capability of the proposed hybrid and heterogeneous manycore architecture in accelerating deep learning kernels in an energy-efficient manner.
- Y. LeCun, Y. Bengio, and G. Hinton. "Deep learning". Nature 521: 436--444. 2015.Google ScholarCross Ref
- D. Silver et al. "Mastering the game of Go with deep neural networks and tree search". Nature 529, 484--489. 2016.Google ScholarCross Ref
- D. Rumelhard, G. Hinton, and R. Willians. "Learning representations by back-propagating errors". Nature 323 (6088): 533--536.Google ScholarCross Ref
- D. Strigl, K. Kofler, and S. Podlipnig, "Performance and Scalability of GPU-Based Convolutional Neural Networks," Proc. Euromicro Int'l Conf. Parallel, Distributed and Network-Based Processing, IEEE, 317-324, 2010. Google ScholarDigital Library
- S. Che et al, "Rodinia: A benchmark suite for heterogeneous computing," in Proc. IEEE Int. Symp. Workload Characterization, 44--54, 2009. Google ScholarDigital Library
- J. Power et al. "Heterogeneous system coherence for integrated CPU-GPU systems." In Proc. of the 46th Int'l Symp. on Microarchitecture, 2013. 457--467. Google ScholarDigital Library
- M.J. Schulte et al, "Achieving Exascale Capabilities through Heterogeneous Computing", IEEE Micro, vol. 35, no.4, 26-36, Aug, 2015.Google ScholarDigital Library
- J. Hestness, S.W. Keckler, D.A. Wood. "GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors". IISWC: 87-97, 2015. Google ScholarDigital Library
- U. Y. Ogras and R. Marculescu, ' "It's a small world after all": NoC Performance Optimization via Long-range Link Insertion, ' in IEEE Trans. on Very Large Scale Integration Systems, Vol.14, No. 7, 2006. Google ScholarDigital Library
- P. Wettin et al., "Design Space Exploration for wireless NoCs Incorporating Irregular Network Routing", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 33, Issue 11, 1732-1745, 2014.Google ScholarCross Ref
- S. Deb et al., "Wireless NoC as Interconnection Backbone for Multicore Chip: Promises and Challenges", IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Vol. 2, No. 2, 228-239, 2012.Google ScholarCross Ref
- S. Deb et al., (2013, December). "Design of an energy efficient CMOS-compatible NoC architecture with millimeter-wave wireless interconnects," IEEE Transactions on Computers, 62(12), pp.2382-2396. Google ScholarDigital Library
- E. Painkras et al., "SpiNNaker: A 1-W 18-core system-on-chip for massively-parallel neural network simulation," IEEE J. Solid-State Circuits, vol. 48, no. 8, 1943--1953Google ScholarCross Ref
- V. Dmitri and R. Ginosar. "Network-on-chip architectures for neural networks.". Proc of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip, 135-144. Google ScholarDigital Library
- A. Firuzan, M. Modarressi, and M. Daneshtalab, M. "Reconfigurable communication fabric for efficient implementation of neural networks". in Proc., of IEEE ReCoSoC, 1-8. 2015.Google ScholarCross Ref
- Y. Chen et al., "DaDianNao: A Machine Learning Supercomputer," Proc. 47th Ann. IEEE/ACM Int'l Symp. Microarchitecture, 609--622, 2014. Google ScholarDigital Library
- A. Coates et al., "Deep learning with COTS HPC systems", Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013.Google Scholar
- A. Bakhoda, J. Kim, and T.M. Aamodt, "Throughput-Effective On-Chip Networks for Manycore Accelerators," Proc. of 46th Int'l Symp. Microarchitecture, 457--467, 2013. Google ScholarDigital Library
- H. Jang et al., "Bandwidth-efficient on-chip interconnect designs for GPGPUs" Design Automation Conference (DAC), 2015 52nd ACM/EDAC/IEEE,San Francisco,CA.1-6. Google ScholarDigital Library
- A. Ziabari et al., "Asymmetric NoC Architectures for GPU Systems" Proc. Of the 9th International Symposium on Network-on-Chip. Article No. 25, 2015. Google ScholarDigital Library
- J. Lee, S. Li, H. Kim, and S. Yalamanchilli, "Design Space Exploration of On-chip Ring Interconnection for a CPU-GPU Heterogeneous Architecture," JPDC, 2013. Google ScholarDigital Library
- O. Kayiran et al., "Managing GPU concurrency in heterogeneous architectures". Proc. 47th Int'l Symp. Microarchitecture, 1--13, 2014. Google ScholarDigital Library
- J. Lee, et al. "Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures." ACM Transactions on Design Automation of Electronic Systems (TODAES) 18.4 (2013): 48. Google ScholarDigital Library
- J-J. Lin et al., (2007, August). "Communication Using Antennas Fabricated in Silicon Integrated Circuits," IEEE Journal of Solid-State Circuits, 42(8), pp.1678-1687.Google Scholar
- Y. P. Zhang, Z. M. Chen, and M. Sun, (2007, October). "Propagation Mechanisms of Radio Waves Over Intra-Chip Channels with Integrated Antennas: Frequency-Domain Measurements and Time-Domain Analysis," Transactions on Antennas and Propagation, 55(10), pp.2900-2906.Google ScholarCross Ref
- J. Branch, et al., (2005, April). "Wireless communication in a flip-chip package using integrated antennas on silicon substrates," Electron Device Letters, 26(2), pp.115-117.Google ScholarCross Ref
- W. Bogaerts, M. Fiers, P. Dumon, "Design Challenges in Silicon Photonics," IEEE Journal of Selected Topics in Quantum Electronics, vol.20, no.4, 1-8, 2014.Google ScholarCross Ref
- A. Karkar, T. Mak, K. F. Tong, and A. Yakovlev, "A Survey of Emerging Interconnects for On-Chip Efficient Multicast and Broadcast in Many-Cores". IEEE Circuits and Systems Magazine, vol. 16, no. 1, 58-72, 2016.Google ScholarCross Ref
- A. Baroon. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, Vol. 39, no.3, 930--945, 1993. Google ScholarDigital Library
- S. Bandyopadhyay, S. Saha, U. Maulik, and K. Deb, "A simulated annealing-based multi-objective optimization algorithm: AMOSA," IEEE Transactions on Evolutionary Computation, vol. 12, no. 3, 269--283, 2008. Google ScholarDigital Library
- O. Lysne, T. Skeie, S.-A. Reinemo and I. Theiss, "Layered routing in irregular networks", IEEE Trans. On Parallel Distributed Systems, 2006, 17(1), 1 -65. Google ScholarDigital Library
- K. Duraisamy, R. Kim, P. Pande, "Enhancing Performance of Wireless NoCs with Distributed MAC Protocols", in Proc., of ISQED, 2015, 406 -- 411.Google Scholar
- J. Power, J. Hestness, M. Orr, M. Hill, and D. Wood, "gem5-gpu: A Heterogeneous CPU-GPU Simulator," Computer Architecture Letters, vol. 13, no. 1, 2014.Google Scholar
- N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, "GARNET: A Detailed On-chip Network Model inside a Full-system Simulator", In Proceedings of International Symposium on Performance Analysis of Systems and Software, Apr. 2009Google ScholarCross Ref
- J. Leng et al., "GPUWattch: enabling energy optimizations in GPGPUs," in International Symposium on Computer Architecture, 487--498, 2013. Google ScholarDigital Library
Index Terms
- Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms
Recommendations
3D NoC-Enabled Heterogeneous Manycore Architectures for Accelerating CNN Training: Performance and Thermal Trade-offs
NOCS '17: Proceedings of the Eleventh IEEE/ACM International Symposium on Networks-on-ChipAs deep learning technology is increasingly employed in diverse applications domains, the demand for computational power to enable these algorithms also increases. In this respect, high-performance three-dimensional (3D) heterogeneous manycore systems ...
GPGPU-Accelerated Parallel and Fast Simulation of Thousand-Core Platforms
CCGRID '11: Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid ComputingThe multicore revolution and the ever-increasing complexity of computing systems is dramatically changing sys-tem design, analysis and programming of computing platforms. Future architectures will feature hundreds to thousands of simple processors and ...
Analysis of computing and energy performance of multicore, NUMA, and manycore platforms for an irregular application
IA3 '13: Proceedings of the 3rd Workshop on Irregular Applications: Architectures and AlgorithmsThe exponential growth in processor performance seems to have reached a turning point. Nowadays, energy efficiency is as important as performance and has become a critical aspect to the development of scalable systems. These strict energy constraints ...
Comments