skip to main content
10.1145/2968455.2968510acmotherconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
research-article
Public Access

Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms

Published:01 October 2016Publication History

ABSTRACT

In recent years, designing specialized manycore heterogeneous architectures for deep learning kernels has become an area of great interest. However, the typical on-chip communication infrastructures employed on conventional manycore platforms are unable to handle both CPU and GPU communication requirements efficiently. Hence, in this paper, our aim is to enhance the performance of heterogeneous manycore architectures through the design of a hybrid NoC consisting of both wireline and wireless links. To this end, we specifically target the resource-intensive backpropagation algorithm commonly used as the training method in deep learning. For backpropagation, the proposed hybrid NoC achieves 1.9X reduction in network latency and improves the network throughput by a factor of 2 with respect to a highly optimized mesh NoC. These network level improvements translate into 25% savings in full system energy-delay-product (EDP). This demonstrates the capability of the proposed hybrid and heterogeneous manycore architecture in accelerating deep learning kernels in an energy-efficient manner.

References

  1. Y. LeCun, Y. Bengio, and G. Hinton. "Deep learning". Nature 521: 436--444. 2015.Google ScholarGoogle ScholarCross RefCross Ref
  2. D. Silver et al. "Mastering the game of Go with deep neural networks and tree search". Nature 529, 484--489. 2016.Google ScholarGoogle ScholarCross RefCross Ref
  3. D. Rumelhard, G. Hinton, and R. Willians. "Learning representations by back-propagating errors". Nature 323 (6088): 533--536.Google ScholarGoogle ScholarCross RefCross Ref
  4. D. Strigl, K. Kofler, and S. Podlipnig, "Performance and Scalability of GPU-Based Convolutional Neural Networks," Proc. Euromicro Int'l Conf. Parallel, Distributed and Network-Based Processing, IEEE, 317-324, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Che et al, "Rodinia: A benchmark suite for heterogeneous computing," in Proc. IEEE Int. Symp. Workload Characterization, 44--54, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Power et al. "Heterogeneous system coherence for integrated CPU-GPU systems." In Proc. of the 46th Int'l Symp. on Microarchitecture, 2013. 457--467. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M.J. Schulte et al, "Achieving Exascale Capabilities through Heterogeneous Computing", IEEE Micro, vol. 35, no.4, 26-36, Aug, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Hestness, S.W. Keckler, D.A. Wood. "GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors". IISWC: 87-97, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. U. Y. Ogras and R. Marculescu, ' "It's a small world after all": NoC Performance Optimization via Long-range Link Insertion, ' in IEEE Trans. on Very Large Scale Integration Systems, Vol.14, No. 7, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Wettin et al., "Design Space Exploration for wireless NoCs Incorporating Irregular Network Routing", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 33, Issue 11, 1732-1745, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  11. S. Deb et al., "Wireless NoC as Interconnection Backbone for Multicore Chip: Promises and Challenges", IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Vol. 2, No. 2, 228-239, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  12. S. Deb et al., (2013, December). "Design of an energy efficient CMOS-compatible NoC architecture with millimeter-wave wireless interconnects," IEEE Transactions on Computers, 62(12), pp.2382-2396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Painkras et al., "SpiNNaker: A 1-W 18-core system-on-chip for massively-parallel neural network simulation," IEEE J. Solid-State Circuits, vol. 48, no. 8, 1943--1953Google ScholarGoogle ScholarCross RefCross Ref
  14. V. Dmitri and R. Ginosar. "Network-on-chip architectures for neural networks.". Proc of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip, 135-144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Firuzan, M. Modarressi, and M. Daneshtalab, M. "Reconfigurable communication fabric for efficient implementation of neural networks". in Proc., of IEEE ReCoSoC, 1-8. 2015.Google ScholarGoogle ScholarCross RefCross Ref
  16. Y. Chen et al., "DaDianNao: A Machine Learning Supercomputer," Proc. 47th Ann. IEEE/ACM Int'l Symp. Microarchitecture, 609--622, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Coates et al., "Deep learning with COTS HPC systems", Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013.Google ScholarGoogle Scholar
  18. A. Bakhoda, J. Kim, and T.M. Aamodt, "Throughput-Effective On-Chip Networks for Manycore Accelerators," Proc. of 46th Int'l Symp. Microarchitecture, 457--467, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Jang et al., "Bandwidth-efficient on-chip interconnect designs for GPGPUs" Design Automation Conference (DAC), 2015 52nd ACM/EDAC/IEEE,San Francisco,CA.1-6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Ziabari et al., "Asymmetric NoC Architectures for GPU Systems" Proc. Of the 9th International Symposium on Network-on-Chip. Article No. 25, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Lee, S. Li, H. Kim, and S. Yalamanchilli, "Design Space Exploration of On-chip Ring Interconnection for a CPU-GPU Heterogeneous Architecture," JPDC, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. O. Kayiran et al., "Managing GPU concurrency in heterogeneous architectures". Proc. 47th Int'l Symp. Microarchitecture, 1--13, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Lee, et al. "Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures." ACM Transactions on Design Automation of Electronic Systems (TODAES) 18.4 (2013): 48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J-J. Lin et al., (2007, August). "Communication Using Antennas Fabricated in Silicon Integrated Circuits," IEEE Journal of Solid-State Circuits, 42(8), pp.1678-1687.Google ScholarGoogle Scholar
  25. Y. P. Zhang, Z. M. Chen, and M. Sun, (2007, October). "Propagation Mechanisms of Radio Waves Over Intra-Chip Channels with Integrated Antennas: Frequency-Domain Measurements and Time-Domain Analysis," Transactions on Antennas and Propagation, 55(10), pp.2900-2906.Google ScholarGoogle ScholarCross RefCross Ref
  26. J. Branch, et al., (2005, April). "Wireless communication in a flip-chip package using integrated antennas on silicon substrates," Electron Device Letters, 26(2), pp.115-117.Google ScholarGoogle ScholarCross RefCross Ref
  27. W. Bogaerts, M. Fiers, P. Dumon, "Design Challenges in Silicon Photonics," IEEE Journal of Selected Topics in Quantum Electronics, vol.20, no.4, 1-8, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  28. A. Karkar, T. Mak, K. F. Tong, and A. Yakovlev, "A Survey of Emerging Interconnects for On-Chip Efficient Multicast and Broadcast in Many-Cores". IEEE Circuits and Systems Magazine, vol. 16, no. 1, 58-72, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  29. A. Baroon. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, Vol. 39, no.3, 930--945, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Bandyopadhyay, S. Saha, U. Maulik, and K. Deb, "A simulated annealing-based multi-objective optimization algorithm: AMOSA," IEEE Transactions on Evolutionary Computation, vol. 12, no. 3, 269--283, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. O. Lysne, T. Skeie, S.-A. Reinemo and I. Theiss, "Layered routing in irregular networks", IEEE Trans. On Parallel Distributed Systems, 2006, 17(1), 1 -65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K. Duraisamy, R. Kim, P. Pande, "Enhancing Performance of Wireless NoCs with Distributed MAC Protocols", in Proc., of ISQED, 2015, 406 -- 411.Google ScholarGoogle Scholar
  33. J. Power, J. Hestness, M. Orr, M. Hill, and D. Wood, "gem5-gpu: A Heterogeneous CPU-GPU Simulator," Computer Architecture Letters, vol. 13, no. 1, 2014.Google ScholarGoogle Scholar
  34. N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, "GARNET: A Detailed On-chip Network Model inside a Full-system Simulator", In Proceedings of International Symposium on Performance Analysis of Systems and Software, Apr. 2009Google ScholarGoogle ScholarCross RefCross Ref
  35. J. Leng et al., "GPUWattch: enabling energy optimizations in GPGPUs," in International Symposium on Computer Architecture, 487--498, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems
          October 2016
          187 pages
          ISBN:9781450344821
          DOI:10.1145/2968455

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 October 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate52of230submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader