research-article

Public Access

Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms

Authors:
Wonje Choi

Washington State University

Washington State University
View Profile

,
Karthi Duraisamy

Washington State University

Washington State University
View Profile

,
Ryan Gary Kim

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Janardhan Rao Doppa

Washington State University

Washington State University
View Profile

,
Partha Pratim Pande

Washington State University

Washington State University
View Profile

,
Radu Marculescu

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Diana Marculescu

Carnegie Mellon University

Carnegie Mellon University
View Profile

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded SystemsOctober 2016Article No.: 13Pages 1–10https://doi.org/10.1145/2968455.2968510

Published:01 October 2016Publication History

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Pages 1–10

ABSTRACT

In recent years, designing specialized manycore heterogeneous architectures for deep learning kernels has become an area of great interest. However, the typical on-chip communication infrastructures employed on conventional manycore platforms are unable to handle both CPU and GPU communication requirements efficiently. Hence, in this paper, our aim is to enhance the performance of heterogeneous manycore architectures through the design of a hybrid NoC consisting of both wireline and wireless links. To this end, we specifically target the resource-intensive backpropagation algorithm commonly used as the training method in deep learning. For backpropagation, the proposed hybrid NoC achieves 1.9X reduction in network latency and improves the network throughput by a factor of 2 with respect to a highly optimized mesh NoC. These network level improvements translate into 25% savings in full system energy-delay-product (EDP). This demonstrates the capability of the proposed hybrid and heterogeneous manycore architecture in accelerating deep learning kernels in an energy-efficient manner.

References

Y. LeCun, Y. Bengio, and G. Hinton. "Deep learning". Nature 521: 436--444. 2015.Google ScholarCross Ref
D. Silver et al. "Mastering the game of Go with deep neural networks and tree search". Nature 529, 484--489. 2016.Google ScholarCross Ref
D. Rumelhard, G. Hinton, and R. Willians. "Learning representations by back-propagating errors". Nature 323 (6088): 533--536.Google ScholarCross Ref
D. Strigl, K. Kofler, and S. Podlipnig, "Performance and Scalability of GPU-Based Convolutional Neural Networks," Proc. Euromicro Int'l Conf. Parallel, Distributed and Network-Based Processing, IEEE, 317-324, 2010. Google ScholarDigital Library
S. Che et al, "Rodinia: A benchmark suite for heterogeneous computing," in Proc. IEEE Int. Symp. Workload Characterization, 44--54, 2009. Google ScholarDigital Library
J. Power et al. "Heterogeneous system coherence for integrated CPU-GPU systems." In Proc. of the 46th Int'l Symp. on Microarchitecture, 2013. 457--467. Google ScholarDigital Library
M.J. Schulte et al, "Achieving Exascale Capabilities through Heterogeneous Computing", IEEE Micro, vol. 35, no.4, 26-36, Aug, 2015.Google ScholarDigital Library
J. Hestness, S.W. Keckler, D.A. Wood. "GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors". IISWC: 87-97, 2015. Google ScholarDigital Library
U. Y. Ogras and R. Marculescu, ' "It's a small world after all": NoC Performance Optimization via Long-range Link Insertion, ' in IEEE Trans. on Very Large Scale Integration Systems, Vol.14, No. 7, 2006. Google ScholarDigital Library
P. Wettin et al., "Design Space Exploration for wireless NoCs Incorporating Irregular Network Routing", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 33, Issue 11, 1732-1745, 2014.Google ScholarCross Ref
S. Deb et al., "Wireless NoC as Interconnection Backbone for Multicore Chip: Promises and Challenges", IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Vol. 2, No. 2, 228-239, 2012.Google ScholarCross Ref
S. Deb et al., (2013, December). "Design of an energy efficient CMOS-compatible NoC architecture with millimeter-wave wireless interconnects," IEEE Transactions on Computers, 62(12), pp.2382-2396. Google ScholarDigital Library
E. Painkras et al., "SpiNNaker: A 1-W 18-core system-on-chip for massively-parallel neural network simulation," IEEE J. Solid-State Circuits, vol. 48, no. 8, 1943--1953Google ScholarCross Ref
V. Dmitri and R. Ginosar. "Network-on-chip architectures for neural networks.". Proc of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip, 135-144. Google ScholarDigital Library
A. Firuzan, M. Modarressi, and M. Daneshtalab, M. "Reconfigurable communication fabric for efficient implementation of neural networks". in Proc., of IEEE ReCoSoC, 1-8. 2015.Google ScholarCross Ref
Y. Chen et al., "DaDianNao: A Machine Learning Supercomputer," Proc. 47th Ann. IEEE/ACM Int'l Symp. Microarchitecture, 609--622, 2014. Google ScholarDigital Library
A. Coates et al., "Deep learning with COTS HPC systems", Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013.Google Scholar
A. Bakhoda, J. Kim, and T.M. Aamodt, "Throughput-Effective On-Chip Networks for Manycore Accelerators," Proc. of 46th Int'l Symp. Microarchitecture, 457--467, 2013. Google ScholarDigital Library
H. Jang et al., "Bandwidth-efficient on-chip interconnect designs for GPGPUs" Design Automation Conference (DAC), 2015 52^nd ACM/EDAC/IEEE,San Francisco,CA.1-6. Google ScholarDigital Library
A. Ziabari et al., "Asymmetric NoC Architectures for GPU Systems" Proc. Of the 9^th International Symposium on Network-on-Chip. Article No. 25, 2015. Google ScholarDigital Library
J. Lee, S. Li, H. Kim, and S. Yalamanchilli, "Design Space Exploration of On-chip Ring Interconnection for a CPU-GPU Heterogeneous Architecture," JPDC, 2013. Google ScholarDigital Library
O. Kayiran et al., "Managing GPU concurrency in heterogeneous architectures". Proc. 47th Int'l Symp. Microarchitecture, 1--13, 2014. Google ScholarDigital Library
J. Lee, et al. "Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures." ACM Transactions on Design Automation of Electronic Systems (TODAES) 18.4 (2013): 48. Google ScholarDigital Library
J-J. Lin et al., (2007, August). "Communication Using Antennas Fabricated in Silicon Integrated Circuits," IEEE Journal of Solid-State Circuits, 42(8), pp.1678-1687.Google Scholar
Y. P. Zhang, Z. M. Chen, and M. Sun, (2007, October). "Propagation Mechanisms of Radio Waves Over Intra-Chip Channels with Integrated Antennas: Frequency-Domain Measurements and Time-Domain Analysis," Transactions on Antennas and Propagation, 55(10), pp.2900-2906.Google ScholarCross Ref
J. Branch, et al., (2005, April). "Wireless communication in a flip-chip package using integrated antennas on silicon substrates," Electron Device Letters, 26(2), pp.115-117.Google ScholarCross Ref
W. Bogaerts, M. Fiers, P. Dumon, "Design Challenges in Silicon Photonics," IEEE Journal of Selected Topics in Quantum Electronics, vol.20, no.4, 1-8, 2014.Google ScholarCross Ref
A. Karkar, T. Mak, K. F. Tong, and A. Yakovlev, "A Survey of Emerging Interconnects for On-Chip Efficient Multicast and Broadcast in Many-Cores". IEEE Circuits and Systems Magazine, vol. 16, no. 1, 58-72, 2016.Google ScholarCross Ref
A. Baroon. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, Vol. 39, no.3, 930--945, 1993. Google ScholarDigital Library
S. Bandyopadhyay, S. Saha, U. Maulik, and K. Deb, "A simulated annealing-based multi-objective optimization algorithm: AMOSA," IEEE Transactions on Evolutionary Computation, vol. 12, no. 3, 269--283, 2008. Google ScholarDigital Library
O. Lysne, T. Skeie, S.-A. Reinemo and I. Theiss, "Layered routing in irregular networks", IEEE Trans. On Parallel Distributed Systems, 2006, 17(1), 1 -65. Google ScholarDigital Library
K. Duraisamy, R. Kim, P. Pande, "Enhancing Performance of Wireless NoCs with Distributed MAC Protocols", in Proc., of ISQED, 2015, 406 -- 411.Google Scholar
J. Power, J. Hestness, M. Orr, M. Hill, and D. Wood, "gem5-gpu: A Heterogeneous CPU-GPU Simulator," Computer Architecture Letters, vol. 13, no. 1, 2014.Google Scholar
N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, "GARNET: A Detailed On-chip Network Model inside a Full-system Simulator", In Proceedings of International Symposium on Performance Analysis of Systems and Software, Apr. 2009Google ScholarCross Ref
J. Leng et al., "GPUWattch: enabling energy optimizations in GPGPUs," in International Symposium on Computer Architecture, 487--498, 2013. Google ScholarDigital Library

Index Terms

Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms

Recommendations

3D NoC-Enabled Heterogeneous Manycore Architectures for Accelerating CNN Training: Performance and Thermal Trade-offs
NOCS '17: Proceedings of the Eleventh IEEE/ACM International Symposium on Networks-on-Chip

As deep learning technology is increasingly employed in diverse applications domains, the demand for computational power to enable these algorithms also increases. In this respect, high-performance three-dimensional (3D) heterogeneous manycore systems ...
Read More
GPGPU-Accelerated Parallel and Fast Simulation of Thousand-Core Platforms
CCGRID '11: Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

The multicore revolution and the ever-increasing complexity of computing systems is dramatically changing sys-tem design, analysis and programming of computing platforms. Future architectures will feature hundreds to thousands of simple processors and ...
Read More
Analysis of computing and energy performance of multicore, NUMA, and manycore platforms for an irregular application
IA³ '13: Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms

The exponential growth in processor performance seems to have reached a turning point. Nowadays, energy efficiency is as important as performance and has become a critical aspect to the development of scalable systems. These strict energy constraints ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems
October 2016
187 pages
ISBN:9781450344821
DOI:10.1145/2968455

Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
NoC
backpropagation
deep learning
heterogeneous
manycore
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate52of230submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 38
  Total Citations
  View Citations
- 1,010
  Total Downloads
- Downloads (Last 12 months)97
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

3D NoC-Enabled Heterogeneous Manycore Architectures for Accelerating CNN Training: Performance and Thermal Trade-offs

GPGPU-Accelerated Parallel and Fast Simulation of Thousand-Core Platforms

Analysis of computing and energy performance of multicore, NUMA, and manycore platforms for an irregular application

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

3D NoC-Enabled Heterogeneous Manycore Architectures for Accelerating CNN Training: Performance and Thermal Trade-offs

GPGPU-Accelerated Parallel and Fast Simulation of Thousand-Core Platforms

Analysis of computing and energy performance of multicore, NUMA, and manycore platforms for an irregular application

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media