research-article

MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects

Authors:
Hyoukjun Kwon

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

,
Ananda Samajdar

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

,
Tushar Krishna

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsMarch 2018Pages 461–475https://doi.org/10.1145/3173162.3173176

Published:19 March 2018Publication History

ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 461–475

ABSTRACT

Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, and are becoming foundational for ubiquitous AI. The computational complexity of these algorithms and a need for high energy-efficiency has led to a surge in research on hardware accelerators. % for this paradigm. To reduce the latency and energy costs of accessing DRAM, most DNN accelerators are spatial in nature, with hundreds of processing elements (PE) operating in parallel and communicating with each other directly. DNNs are evolving at a rapid rate, and it is common to have convolution, recurrent, pooling, and fully-connected layers with varying input and filter sizes in the most recent topologies.They may be dense or sparse. They can also be partitioned in myriad ways (within and across layers) to exploit data reuse (weights and intermediate outputs). All of the above can lead to different dataflow patterns within the accelerator substrate. Unfortunately, most DNN accelerators support only fixed dataflow patterns internally as they perform a careful co-design of the PEs and the network-on-chip (NoC). In fact, the majority of them are only optimized for traffic within a convolutional layer. This makes it challenging to map arbitrary dataflows on the fabric efficiently, and can lead to underutilization of the available compute resources. DNN accelerators need to be programmable to enable mass deployment. For them to be programmable, they need to be configurable internally to support the various dataflow patterns that could be mapped over them. To address this need, we present MAERI, which is a DNN accelerator built with a set of modular and configurable building blocks that can easily support myriad DNN partitions and mappings by appropriately configuring tiny switches. MAERI provides 8-459% better utilization across multiple dataflow mappings over baselines with rigid NoC fabrics.

References

K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.Google Scholar
K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, Vaino, L. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu, "Deep speech 2: End-to-end speech recognition in english and mandarin," arXiv preprint arXiv:1512.02595, 2015.Google Scholar
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," in ASPLOS, pp. 269--284, 2014. Google ScholarDigital Library
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "Dadiannao: A machine-learning supercomputer," in MICRO, pp. 609--622, 2014. Google ScholarDigital Library
Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "Shidiannao: Shifting vision processing closer to the sensor," in ISCA, 2015. Google ScholarDigital Library
Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in ISCA, 2016. Google ScholarDigital Library
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, andW. J. Dally, "Eie: efficient inference engine on compressed deep neural network," in ISCA, 2016. Google ScholarDigital Library
A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "Scnn: An accelerator for compressed-sparse convolutional neural networks," in Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 27--40, ACM, 2017. Google ScholarDigital Library
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in NIPS, pp. 1097--1105, 2012. Google ScholarDigital Library
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770--778, 2016.Google Scholar
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.Google Scholar
S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, A. Ng, and M. Shoeybi, "Deep voice: Real-time neural text-to-speech," arXiv preprint arXiv:1702.07825, 2017.Google Scholar
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," in ACM Sigplan Notices, vol. 49, pp. 269--284, ACM, 2014. Google ScholarDigital Library
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "Dadiannao: A machine-learning supercomputer," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609--622, IEEE Computer Society, 2014. Google ScholarDigital Library
M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer CNN accelerators," in 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016. Google ScholarDigital Library
L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, "C-brain: A deep learning accelerator that tames the diversity of cnns through adaptive data-level parallelization," in DAC, pp. 1--6, 2016. Google ScholarDigital Library
F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S.Wei, "Deep convolutional neural network architecture with reconfigurable computation patterns," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2017.Google ScholarDigital Library
J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, "Cnvlutin: Ineffectual-neuron-free deep neural network computing," in ISCA, pp. 1--13, 2016. Google ScholarDigital Library
S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, "Cambricon-x: An accelerator for sparse neural networks," in MICRO, 2016. Google ScholarDigital Library
C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing fpga-based accelerator design for deep convolutional neural networks," in FPGA, pp. 161--170, 2015. Google ScholarDigital Library
W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, "Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks," in HPCA, 2017.Google Scholar
N. P. Jouppi,, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati,W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, "In-datacenter performance analysis of a tensor processing unit," in Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017. Google ScholarDigital Library
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, "Imagenet large scale visual recognition challenge," International Journal of Computer Vision, vol. 115, no. 3, pp. 211--252, 2015. Google ScholarDigital Library
Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in ISCA, pp. 367--379, 2016. Google ScholarDigital Library
J. Cong and B. Xiao, "Minimizing computation in convolutional neural networks," in International Conference on Artificial Neural Networks, pp. 281--290, Springer, 2014.Google Scholar
S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735--1780, 1997. Google ScholarDigital Library
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," arXiv preprint arXiv:1408.5093, 2014.Google Scholar
Theano Development Team, "Theano: A Python framework for fast computation of mathematical expressions," arXiv e-prints, vol. abs/1605.02688, May 2016.Google Scholar
M. Abadi, A. Agarwal, and P. Barham, "TensorFlow: Large-scale machine learning on heterogeneous systems," 2015. Software available from tensorflow.org.Google Scholar
R. Collobert, K. Kavukcuoglu, and C. Farabet, "Torch7: A matlab-like environment for machine learning," in BigLearn, NIPS Workshop, 2011.Google Scholar
Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, "Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 45--54, ACM, 2017. Google ScholarDigital Library
H. Kwon, A. Samajdar, and T. Krishna, "Rethinking nocs for spatial neural network accelerators," in NOCS, 2017. Google ScholarDigital Library
W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, "Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks," in High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pp. 553--564, IEEE, 2017.Google Scholar
C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing fpga-based accelerator design for deep convolutional neural networks," in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161--170, ACM, 2015. Google ScholarDigital Library
M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer cnn accelerators," in Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pp. 1--12, IEEE, 2016. Google ScholarDigital Library
H. Sharma, J. Park, E. Amaro, B. Thwaites, P. Kotha, A. Gupta, J. K. Kim, A. Mishra, and H. Esmaeilzadeh, "Dnnweaver: From high-level deep network models to fpga acceleration," in the Workshop on Cognitive Architectures, 2016. Google ScholarDigital Library
A. Singh, J. Ong, A. Agarwal, G. Anderson, A. Armistead, R. Bannon, S. Boving, G. Desai, B. Felderman, P. Germano, A. Kanagala, J. Provost, J. Simmons, J. Wanderer, U. Holzle, S. Stuart, and A. Vahdat, "Jupiter rising: A decade of clos topologies and centralized control in google's datacenter network," ACM SIGCOMM Computer Communication Review, vol. 45, no. 4, pp. 183--197, 2015. Google ScholarDigital Library
R. Nikhil, "Bluespec system verilog: efficient, correct rtl from high level specifications," in MEMOCODE, pp. 69--70, IEEE, 2004. Google ScholarDigital Library
Synopsys, "DesignWare IP Embedded Memory for TSMC 28-nm." https: //www.synopsys.com/dw/doc.php/ds/es/DW-28-nm-DS.pdf.Google Scholar
Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks," IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127--138, 2017.Google ScholarCross Ref
D. Vainbrand et al., "Network-on-chip architectures for neural networks," in NOCS, pp. 135--144, 2010. Google ScholarDigital Library
J. Harkin et al., "Reconfigurable platforms and the challenges for large-scale implementations of spiking neural networks," in Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on, pp. 483--486, IEEE, 2008.Google Scholar
T. Theocharides et al., "A generic reconfigurable neural network architecture implemented as a network on chip," in SOC, 2004.Google Scholar
R. Emery et al., "Connection-centric network for spiking neural networks," in NOCS, pp. 144--152, 2009. Google ScholarDigital Library
Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "Shidiannao: Shifting vision processing closer to the sensor," in ACM SIGARCH Computer Architecture News, vol. 43, pp. 92-- 104, ACM, 2015. Google ScholarDigital Library
W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz, "Convolution engine: balancing efficiency&flexibility in specialized computing," in ISCA, pp. 24--35, 2013. Google ScholarDigital Library
Y. Ji, Y. Zhang, S. Li, P. Chi, C. Jiang, P. Qu, Y. Xie, and W. Chen, "Neutrams: Neural network transformation and co-design under neuromorphic hardware constraints," in MICRO, pp. 1--13, 2016. Google ScholarDigital Library
Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks, "Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures," in ISCA, pp. 97--108, 2014. Google ScholarDigital Library
M. Zhu, L. Liu, C.Wang, and Y. Xie, "Cnnlab: a novel parallel framework for neural networks using gpu and fpga-a practical study with trade-off analysis," arXiv preprint arXiv:1606.06234, 2016.Google Scholar
Y. Shen, M. Ferdman, and P. Milder, "Maximizing CNN accelerator efficiency through resource partitioning," in 44th International Symposium on Computer Architecture (ISCA), 2017. Google ScholarDigital Library
J. L. Elman, "Finding structure in time," Cognitive science, vol. 14, no. 2, pp. 179--211, 1990.Google ScholarCross Ref
M. I. Jordan, "Serial order: A parallel distributed processing approach," Advances in psychology, vol. 121, pp. 471--495, 1997.Google ScholarCross Ref
C. Goller and A. Kuchler, "Learning task-dependent distributed representations by backpropagation through structure," in IEEE Neural Networks, vol. 1, pp. 347--352, 1996.Google Scholar
A. X. M. Chang, B. Martini, and E. Culurciello, "Recurrent neural networks hardware implementation on fpga," arXiv preprint arXiv:1511.05552, 2015.Google Scholar
Y. Guan, Z. Yuan, G. Sun, and J. Cong, "Fpga-based accelerator for long short-term memory recurrent neural networks," in ASP-DAC, pp. 629--634, 2017.Google Scholar
S. Li, C. Wu, H. Li, B. Li, Y. Wang, and Q. Qiu, "Fpga acceleration of recurrent neural network based language model," in FCCM, pp. 111-- 118, 2015. Google ScholarDigital Library
M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung, "Fpga-based low-power speech recognition with recurrent neural networks," in SiPS, pp. 230--235, 2016.Google Scholar
S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. J. Dally, "Ese: Efficient speech recognition engine with sparse lstm on fpga," in FPGA, pp. 75--84, 2017. Google ScholarDigital Library
J. J. Hopfield, "Neural networks and physical systems with emergent collective computational abilities," Proceedings of the national academy of sciences, vol. 79, no. 8, pp. 2554--2558, 1982.Google ScholarCross Ref
Y. Maeda and M. Wakamura, "Simultaneous perturbation learning rule for recurrent neural networks and its fpga implementation," IEEE Transactions on Neural Networks, vol. 16, no. 6, pp. 1664--1672, 2005. Google ScholarDigital Library
R. Tavcar, J. Dedic, D. Bokal, and A. Zemva, "Transforming the lstm training algorithm for efficient fpga-based adaptive control of nonlinear dynamic systems," Informacije Midem-Journal of Microelectronics Electronic Components and Materials, vol. 43, no. 2, pp. 131--138, 2013.Google Scholar
J. Kung, D. Kim, and S. Mukhopadhyay, "Dynamic approximation with feedback control for energy-efficient recurrent neural network hardware," in ISLPED, pp. 168--173, 2016. Google ScholarDigital Library
D. Shin, J. Lee, J. Lee, and H.-J. Yoo, "14.2 dnpu: An 8.1 tops/w reconfigurable cnn-rnn processor for general-purpose deep neural networks," in ISSCC, pp. 240--241, 2017.Google Scholar

Index Terms

MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
    2. Parallel architectures
      1. Interconnection architectures
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects
ASPLOS '18

Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, and are becoming foundational for ubiquitous AI. The computational complexity of these algorithms and a need for high energy-efficiency ...
Read More
Scene text recognition using residual convolutional recurrent neural network

Text is a significant tool for human communication, and text recognition in scene images becomes more and more important. In this paper, we propose a residual convolutional recurrent neural network for solving the task of scene text recognition. The ...
Read More
Optimizing DCNN FPGA accelerator design for handwritten hangul character recognition: work-in-progress
CASES '17: Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion

Deep¹ Convolutional Neural Network (DCNN) is a break-through technology in image recognition. However, because of extreme computing resource requirements, DCNN need to be implemented by hardware accelerator. In this paper, we present an FPGA-based ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
March 2018
827 pages
ISBN:9781450349116
DOI:10.1145/3173162
General Chairs:
Xipeng Shen
North Carolina State University, USA
,
James Tuck
North Carolina State University, USA
,
Program Chairs:
Ricardo Bianchini
Microsoft Research, USA
,
Vivek Sarkar
Georgia Institute of Technology, USA
ACM SIGPLAN Notices Volume 53, Issue 2
ASPLOS '18
February 2018
809 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3296957
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 March 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
convolutional neural network
deep learning accelerator
machine learning
network-on-chip
recurrent neural network
spatial architecture
Qualifiers
- research-article
Conference

Acceptance Rates
ASPLOS '18 Paper Acceptance Rate56of319submissions,18%Overall Acceptance Rate535of2,713submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 325
  Total Citations
  View Citations
- 3,678
  Total Downloads
- Downloads (Last 12 months)766
- Downloads (Last 6 weeks)70
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects

ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects

Scene text recognition using residual convolutional recurrent neural network

Optimizing DCNN FPGA accelerator design for handwritten hangul character recognition: work-in-progress