research-article

FeatherNet: An Accelerated Convolutional Neural Network Design for Resource-constrained FPGAs

Authors:
Raghid Morcel

American University of Beirut, Beirut, Lebanon

American University of Beirut, Beirut, Lebanon

0000-0002-1280-9291
View Profile

,
Hazem Hajj

American University of Beirut, Beirut, Lebanon

American University of Beirut, Beirut, Lebanon
View Profile

,
Mazen A. R. Saghir

American University of Beirut, Beirut, Lebanon

American University of Beirut, Beirut, Lebanon
View Profile

,
Haitham Akkary

American University of Beirut, Beirut, Lebanon

American University of Beirut, Beirut, Lebanon
View Profile

,
Hassan Artail

American University of Beirut, Beirut, Lebanon

American University of Beirut, Beirut, Lebanon
View Profile

,
Rahul Khanna

Intel Corporation, Hillsboro, Oregon, USA

Intel Corporation, Hillsboro, Oregon, USA
View Profile

,
Anil Keshavamurthy

Intel Corporation, Hillsboro, Oregon, USA

Intel Corporation, Hillsboro, Oregon, USA
View Profile

ACM Transactions on Reconfigurable Technology and Systems Volume 12 Issue 2Article No.: 6pp 1–27https://doi.org/10.1145/3306202

Published:28 March 2019Publication History

ACM Transactions on Reconfigurable Technology and Systems

Abstract

Convolutional Neural Network (ConvNet or CNN) algorithms are characterized by a large number of model parameters and high computational complexity. These two requirements have made it challenging for implementations on resource-limited FPGAs. The challenges are magnified when considering designs for low-end FPGAs. While previous work has demonstrated successful ConvNet implementations with high-end FPGAs, this article presents a ConvNet accelerator design that enables the implementation of complex deep ConvNet architectures on resource-constrained FPGA platforms aimed at the IoT market. We call the design “FeatherNet” for its light resource utilization. The implementations are VHDL-based providing flexibility in design optimizations. As part of the design process, new methods are introduced to address several design challenges. The first method is a novel stride-aware graph-based method targeted at ConvNets that aims at achieving efficient signal processing with reduced resource utilization. The second method addresses the challenge of determining the minimal precision arithmetic needed while preserving high accuracy. For this challenge, we propose variable-width dynamic fixed-point representations combined with a layer-by-layer design-space pruning heuristic across the different layers of the deep ConvNet model. The third method aims at achieving a modular design that can support different types of ConvNet layers while ensuring low resource utilization. For this challenge, we propose the modules to be relatively small and composed of computational filters that can be interconnected to build an entire accelerator design. These model elements can be easily configured through HDL parameters (e.g., layer type, mask size, stride, etc.) to meet the needs of specific ConvNet implementations and thus they can be reused to implement a wide variety of ConvNet architectures. The fourth method addresses the challenge of design portability between two different FPGA vendor platforms, namely, Intel/Altera and Xilinx. For this challenge, we propose to instantiate the device-specific hardware blocks needed in each computational filter, rather than relying on the synthesis tools to infer these blocks, while keeping track of the similarities and differences between the two platforms. We believe that the solutions to these design challenges further advance knowledge as they can benefit designers and other researchers using similar devices or facing similar challenges. Our results demonstrated the success of addressing the design challenges and achieving low (30%) resource utilization for the low-end FPGA platforms: Zedboard and Cyclone V. The design overcame the limitation of designs targeted for high-end platforms and that cannot fit on low-end IoT platforms. Furthermore, our design showed superior performance results (measured in terms of [Frame/s/W] per Dollar) compared to high-end optimized designs.

References

Kamel Abdelouahab, Maxime Pelcat, Jocelyn Serot, Cedric Bourrasset, and Francois Berry. 2017. Tactics to directly map CNN graphs on embedded FPGAs. IEEE Embed. Syst. Lett. 9, 4 (Dec. 2017), 113--116.Google Scholar
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE Press, Piscataway, NJ. Retrieved from http://dl.acm.org/citation.cfm?id=3195638.3195664. Google ScholarDigital Library
Avnet. 2017. ZedBoard. Retrieved from http://zedboard.org/product/zedboard.Google Scholar
Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, and Gordon R. Chiu. 2017. An OpenCL deep-learning accelerator on Arria 10. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, New York, NY, 55--64. Google ScholarDigital Library
David Blaauw, Dennis Sylvester, Prabal Dutta, Yoonmyung Lee, Inhee Lee, Suyoung Bang, Yejoong Kim, Gyouho Kim, Pat Pannuto, Ye sheng Kuo, Dongmin Yoon, Wanyeong Jung, ZhiYoong Foo, Yen-Po Chen, Sechang Oh, Seokhyeon Jeong, and Mun Ho Choi. 2014. IoT design space challenges: Circuits and systems. In Proceedings of the Symposium on VLSI Technology (VLSITechnology’14). IEEE, 1--2.Google ScholarCross Ref
BVLC. 2001. Model Zoo. Retrieved from http://ccrma.stanford.edu/&sim;jos/bayes/bayes.html.Google Scholar
Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or . Retrieved from http://arxiv.org/abs/1602.02830.Google Scholar
Alpha Data. 2017. An Open Source FPGA CNN Library. Retrieved from ftp://ftp.alpha-data.com/pub/appnotes/cnn/ad-an-0055_v1_0.pdf.Google Scholar
Dan E. Dudgeon and Russell M. Mersereau. 1983. Multidimensional Digital Signal Processing. Prentice-Hall, Englewood Cliffs, NJ. Google ScholarDigital Library
Clément Farabet, Berin Martini, Polina Akselrod, Selçuk Talay, Yann LeCun, and Eugenio Culurciello. 2010. Hardware accelerated convolutional neural networks for synthetic vision systems. In Proceedings of the IEEE International Symposium on Circuits and Systems. IEEE, 257--260.Google ScholarCross Ref
Clément Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello, and Yann LeCun. 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’11). IEEE, 109--116.Google ScholarCross Ref
Clément Farabet, Cyril Poulet, Jefferson Y. Han, and Yann LeCun. 2009. CNP: An FPGA-based processor for convolutional networks. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 32--37.Google ScholarCross Ref
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML’15). JMLR.org, 1737--1746. Retrieved from http://dl.acm.org/citation.cfm?id=3045118.3045303. Google ScholarDigital Library
Philipp M. Gysel. 2016. Ristretto: Hardware-oriented approximation of convolutional neural networks. Master’s thesis. University of California, Davis, Davis, CA.Google Scholar
Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. Retrieved from http://arxiv.org/abs/1510.00149.Google Scholar
Scott Hauck and Andre DeHon. 2007. Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation. Morgan Kaufmann Publishers Inc., San Francisco, CA. Google ScholarDigital Library
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18, 1 (Jan. 2017), 6869--6898. Retrieved from http://dl.acm.org/citation.cfm?id=3122009.3242044. Google ScholarDigital Library
Intel. 2014. Intel SDK for OpenCL Applications. Retrieved from https://software.intel.com/en-us/intel-opencl.Google Scholar
Intel/Altera. 2017. Cyclone V. Retrieved from https://www.altera.com/products/fpga/cyclone-series/cyclone-v/overview.html.Google Scholar
Intel/Altera. 2017. Cyclone V-GX FPGA Development Kit. Retrieved from https://www.altera.com/products/boards_and_kits/dev-kits/altera/kit-cyclone-v-gx.html.Google Scholar
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew G. Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2704--2713.Google ScholarCross Ref
Hrishikesh Jayakumar, Kangwoo Lee, Woo Suk Lee, Arnab Raha, Younghyun Kim, and Vijay Raghunathan. 2014. Powering the internet of things. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED’14). ACM, New York, NY, 375--380. Google ScholarDigital Library
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14). ACM, New York, NY, 675--678. Google ScholarDigital Library
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 1097--1105. Retrieved from http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf. Google ScholarDigital Library
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (May 2015), 436.Google ScholarCross Ref
Fengfu Li and Bin Liu. 2016. Ternary weight networks. Retrieved from http://arxiv.org/abs/1605.04711.Google Scholar
Darryl D. Lin, Sachin S. Talathi, and V. Sreekanth Annapureddy. 2016. Fixed point quantization of deep convolutional networks. In Proceedings of the 33rd International Conference on Machine Learning (ICML’16). JMLR.org, 2849--2858. Retrieved from http://dl.acm.org/citation.cfm?id=3045390.3045690. Google ScholarDigital Library
Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck. 1999. Discrete-time Signal Processing (2nd Ed.). Prentice-Hall, Inc., Upper Saddle River, NJ. Google ScholarDigital Library
Keshab K. Parhi. 1999. VLSI Digital Signal Processing Systems Design and Implementation. Wiley 8 Songs, Inc., New York, NY.Google Scholar
Peng Peng, You Mingyu, and Xu Weisheng. 2017. Running 8-bit dynamic fixed-point convolutional neural network on low-cost ARM platforms. In Proceedings of the Chinese Automation Congress (CAC’17). IEEE, 4564--4568.Google ScholarCross Ref
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision (ECCV’16), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 525--542.Google ScholarCross Ref
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 3 (Dec. 2015), 211--252. Google ScholarDigital Library
Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE Press, Piscataway, NJ. Retrieved from http://dl.acm.org/citation.cfm?id=3195638.3195659. Google ScholarDigital Library
Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN accelerator efficiency through resource partitioning. SIGARCH Comput. Archit. News 45, 2 (June 2017), 535--547. Google ScholarDigital Library
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from http://arxiv.org/abs/1409.1556.Google Scholar
Hemendra Singh. 2018. How Much Does it Cost to Develop an IoT Application? Retrieved from http://customerthink.com/how-much-does-it-cost-to-develop-an-iot-application/.Google Scholar
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--9.Google ScholarCross Ref
Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Proceedings of the Deep Learning and Unsupervised Feature Learning NIPS Workshop, Vol. 1. Citeseer, 4.Google Scholar
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (Apr. 2009), 65--76. Google ScholarDigital Library
Xilinx. 2017. Zynq-7000: All Programmable SoC with Hardware and Software Programmability. Retrieved from https://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html.Google Scholar
Xilinx. 2018. Vivado High-Level Synthesis. Retrieved from https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.Google Scholar
Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, Springer International Publishing, 818--833.Google Scholar
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, 161--170. Google ScholarDigital Library
Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. Retrieved from http://arxiv.org/abs/1606.06160.Google Scholar

Index Terms

FeatherNet: An Accelerated Convolutional Neural Network Design for Resource-constrained FPGAs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Reconfigurable computing
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
      2. Reconfigurable logic applications

Recommendations

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple ...
Read More
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs ...
Read More
PLACID: A Platform for FPGA-Based Accelerator Creation for DCNNs

Deep Convolutional Neural Networks (DCNNs) exhibit remarkable performance in a number of pattern recognition and classification tasks. Modern DCNNs involve many millions of parameters and billions of operations. Inference using such DCNNs, if ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Reconfigurable Technology and Systems Volume 12, Issue 2
June 2019
117 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3322884
Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 March 2019
- Accepted: 1 January 2019
- Revised: 1 November 2018
- Received: 1 November 2017
Published in trets Volume 12, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Convolutional neural networks
IoT applications
embedded-vision
resource-constrained FPGAs
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 591
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

FeatherNet: An Accelerated Convolutional Neural Network Design for Resource-constrained FPGAs

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

PLACID: A Platform for FPGA-Based Accelerator Creation for DCNNs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

FeatherNet: An Accelerated Convolutional Neural Network Design for Resource-constrained FPGAs

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

PLACID: A Platform for FPGA-Based Accelerator Creation for DCNNs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media