skip to main content
10.1145/3173162.3173176acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects

Published:19 March 2018Publication History

ABSTRACT

Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, and are becoming foundational for ubiquitous AI. The computational complexity of these algorithms and a need for high energy-efficiency has led to a surge in research on hardware accelerators. % for this paradigm. To reduce the latency and energy costs of accessing DRAM, most DNN accelerators are spatial in nature, with hundreds of processing elements (PE) operating in parallel and communicating with each other directly. DNNs are evolving at a rapid rate, and it is common to have convolution, recurrent, pooling, and fully-connected layers with varying input and filter sizes in the most recent topologies.They may be dense or sparse. They can also be partitioned in myriad ways (within and across layers) to exploit data reuse (weights and intermediate outputs). All of the above can lead to different dataflow patterns within the accelerator substrate. Unfortunately, most DNN accelerators support only fixed dataflow patterns internally as they perform a careful co-design of the PEs and the network-on-chip (NoC). In fact, the majority of them are only optimized for traffic within a convolutional layer. This makes it challenging to map arbitrary dataflows on the fabric efficiently, and can lead to underutilization of the available compute resources. DNN accelerators need to be programmable to enable mass deployment. For them to be programmable, they need to be configurable internally to support the various dataflow patterns that could be mapped over them. To address this need, we present MAERI, which is a DNN accelerator built with a set of modular and configurable building blocks that can easily support myriad DNN partitions and mappings by appropriately configuring tiny switches. MAERI provides 8-459% better utilization across multiple dataflow mappings over baselines with rigid NoC fabrics.

References

  1. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  2. K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, Vaino, L. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu, "Deep speech 2: End-to-end speech recognition in english and mandarin," arXiv preprint arXiv:1512.02595, 2015.Google ScholarGoogle Scholar
  3. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," in ASPLOS, pp. 269--284, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "Dadiannao: A machine-learning supercomputer," in MICRO, pp. 609--622, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "Shidiannao: Shifting vision processing closer to the sensor," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, andW. J. Dally, "Eie: efficient inference engine on compressed deep neural network," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "Scnn: An accelerator for compressed-sparse convolutional neural networks," in Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 27--40, ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in NIPS, pp. 1097--1105, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.Google ScholarGoogle Scholar
  11. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770--778, 2016.Google ScholarGoogle Scholar
  12. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  13. S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, A. Ng, and M. Shoeybi, "Deep voice: Real-time neural text-to-speech," arXiv preprint arXiv:1702.07825, 2017.Google ScholarGoogle Scholar
  14. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," in ACM Sigplan Notices, vol. 49, pp. 269--284, ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "Dadiannao: A machine-learning supercomputer," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609--622, IEEE Computer Society, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer CNN accelerators," in 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, "C-brain: A deep learning accelerator that tames the diversity of cnns through adaptive data-level parallelization," in DAC, pp. 1--6, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S.Wei, "Deep convolutional neural network architecture with reconfigurable computation patterns," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, "Cnvlutin: Ineffectual-neuron-free deep neural network computing," in ISCA, pp. 1--13, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, "Cambricon-x: An accelerator for sparse neural networks," in MICRO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing fpga-based accelerator design for deep convolutional neural networks," in FPGA, pp. 161--170, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, "Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks," in HPCA, 2017.Google ScholarGoogle Scholar
  23. N. P. Jouppi,, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati,W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, "In-datacenter performance analysis of a tensor processing unit," in Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, "Imagenet large scale visual recognition challenge," International Journal of Computer Vision, vol. 115, no. 3, pp. 211--252, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in ISCA, pp. 367--379, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Cong and B. Xiao, "Minimizing computation in convolutional neural networks," in International Conference on Artificial Neural Networks, pp. 281--290, Springer, 2014.Google ScholarGoogle Scholar
  27. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735--1780, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," arXiv preprint arXiv:1408.5093, 2014.Google ScholarGoogle Scholar
  29. Theano Development Team, "Theano: A Python framework for fast computation of mathematical expressions," arXiv e-prints, vol. abs/1605.02688, May 2016.Google ScholarGoogle Scholar
  30. M. Abadi, A. Agarwal, and P. Barham, "TensorFlow: Large-scale machine learning on heterogeneous systems," 2015. Software available from tensorflow.org.Google ScholarGoogle Scholar
  31. R. Collobert, K. Kavukcuoglu, and C. Farabet, "Torch7: A matlab-like environment for machine learning," in BigLearn, NIPS Workshop, 2011.Google ScholarGoogle Scholar
  32. Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, "Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 45--54, ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. H. Kwon, A. Samajdar, and T. Krishna, "Rethinking nocs for spatial neural network accelerators," in NOCS, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, "Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks," in High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pp. 553--564, IEEE, 2017.Google ScholarGoogle Scholar
  35. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing fpga-based accelerator design for deep convolutional neural networks," in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161--170, ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer cnn accelerators," in Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pp. 1--12, IEEE, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. Sharma, J. Park, E. Amaro, B. Thwaites, P. Kotha, A. Gupta, J. K. Kim, A. Mishra, and H. Esmaeilzadeh, "Dnnweaver: From high-level deep network models to fpga acceleration," in the Workshop on Cognitive Architectures, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Singh, J. Ong, A. Agarwal, G. Anderson, A. Armistead, R. Bannon, S. Boving, G. Desai, B. Felderman, P. Germano, A. Kanagala, J. Provost, J. Simmons, J. Wanderer, U. Holzle, S. Stuart, and A. Vahdat, "Jupiter rising: A decade of clos topologies and centralized control in google's datacenter network," ACM SIGCOMM Computer Communication Review, vol. 45, no. 4, pp. 183--197, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. Nikhil, "Bluespec system verilog: efficient, correct rtl from high level specifications," in MEMOCODE, pp. 69--70, IEEE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Synopsys, "DesignWare IP Embedded Memory for TSMC 28-nm." https: //www.synopsys.com/dw/doc.php/ds/es/DW-28-nm-DS.pdf.Google ScholarGoogle Scholar
  41. Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks," IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127--138, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  42. D. Vainbrand et al., "Network-on-chip architectures for neural networks," in NOCS, pp. 135--144, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Harkin et al., "Reconfigurable platforms and the challenges for large-scale implementations of spiking neural networks," in Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on, pp. 483--486, IEEE, 2008.Google ScholarGoogle Scholar
  44. T. Theocharides et al., "A generic reconfigurable neural network architecture implemented as a network on chip," in SOC, 2004.Google ScholarGoogle Scholar
  45. R. Emery et al., "Connection-centric network for spiking neural networks," in NOCS, pp. 144--152, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "Shidiannao: Shifting vision processing closer to the sensor," in ACM SIGARCH Computer Architecture News, vol. 43, pp. 92-- 104, ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz, "Convolution engine: balancing efficiency&flexibility in specialized computing," in ISCA, pp. 24--35, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Y. Ji, Y. Zhang, S. Li, P. Chi, C. Jiang, P. Qu, Y. Xie, and W. Chen, "Neutrams: Neural network transformation and co-design under neuromorphic hardware constraints," in MICRO, pp. 1--13, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks, "Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures," in ISCA, pp. 97--108, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. Zhu, L. Liu, C.Wang, and Y. Xie, "Cnnlab: a novel parallel framework for neural networks using gpu and fpga-a practical study with trade-off analysis," arXiv preprint arXiv:1606.06234, 2016.Google ScholarGoogle Scholar
  51. Y. Shen, M. Ferdman, and P. Milder, "Maximizing CNN accelerator efficiency through resource partitioning," in 44th International Symposium on Computer Architecture (ISCA), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. J. L. Elman, "Finding structure in time," Cognitive science, vol. 14, no. 2, pp. 179--211, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  53. M. I. Jordan, "Serial order: A parallel distributed processing approach," Advances in psychology, vol. 121, pp. 471--495, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  54. C. Goller and A. Kuchler, "Learning task-dependent distributed representations by backpropagation through structure," in IEEE Neural Networks, vol. 1, pp. 347--352, 1996.Google ScholarGoogle Scholar
  55. A. X. M. Chang, B. Martini, and E. Culurciello, "Recurrent neural networks hardware implementation on fpga," arXiv preprint arXiv:1511.05552, 2015.Google ScholarGoogle Scholar
  56. Y. Guan, Z. Yuan, G. Sun, and J. Cong, "Fpga-based accelerator for long short-term memory recurrent neural networks," in ASP-DAC, pp. 629--634, 2017.Google ScholarGoogle Scholar
  57. S. Li, C. Wu, H. Li, B. Li, Y. Wang, and Q. Qiu, "Fpga acceleration of recurrent neural network based language model," in FCCM, pp. 111-- 118, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung, "Fpga-based low-power speech recognition with recurrent neural networks," in SiPS, pp. 230--235, 2016.Google ScholarGoogle Scholar
  59. S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. J. Dally, "Ese: Efficient speech recognition engine with sparse lstm on fpga," in FPGA, pp. 75--84, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. J. J. Hopfield, "Neural networks and physical systems with emergent collective computational abilities," Proceedings of the national academy of sciences, vol. 79, no. 8, pp. 2554--2558, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  61. Y. Maeda and M. Wakamura, "Simultaneous perturbation learning rule for recurrent neural networks and its fpga implementation," IEEE Transactions on Neural Networks, vol. 16, no. 6, pp. 1664--1672, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. R. Tavcar, J. Dedic, D. Bokal, and A. Zemva, "Transforming the lstm training algorithm for efficient fpga-based adaptive control of nonlinear dynamic systems," Informacije Midem-Journal of Microelectronics Electronic Components and Materials, vol. 43, no. 2, pp. 131--138, 2013.Google ScholarGoogle Scholar
  63. J. Kung, D. Kim, and S. Mukhopadhyay, "Dynamic approximation with feedback control for energy-efficient recurrent neural network hardware," in ISLPED, pp. 168--173, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. D. Shin, J. Lee, J. Lee, and H.-J. Yoo, "14.2 dnpu: An 8.1 tops/w reconfigurable cnn-rnn processor for general-purpose deep neural networks," in ISSCC, pp. 240--241, 2017.Google ScholarGoogle Scholar

Index Terms

  1. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
          March 2018
          827 pages
          ISBN:9781450349116
          DOI:10.1145/3173162
          • cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 53, Issue 2
            ASPLOS '18
            February 2018
            809 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/3296957
            Issue’s Table of Contents

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 March 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          ASPLOS '18 Paper Acceptance Rate56of319submissions,18%Overall Acceptance Rate535of2,713submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader