skip to main content
10.1145/3020078.3021740acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

Published:22 February 2017Publication History

ABSTRACT

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for accelerating DNNs. Current FPGAs offer superior energy efficiency (Ops/Watt), but they do not offer the performance of today's GPUs on DNNs. In this paper, we look at upcoming FPGA technology advances, the rapid pace of innovation in DNN algorithms, and consider whether future high-performance FPGAs will outperform GPUs for next-generation DNNs. The upcoming Intel® 14-nm Stratix? 10 FPGAs will have thousands of hard floating-point units (DSPs) and on-chip RAMs (M20K memory blocks). They will also have high bandwidth memories (HBMs) and improved frequency (HyperFlex? core architecture). This combination of features brings FPGA raw floating point performance within striking distance of GPUs. Meanwhile, DNNs are quickly evolving. For example, recent innovations that exploit sparsity (e.g., pruning) and compact data types (e.g., 1-2 bit) result in major leaps in algorithmic efficiency. However, these innovations introduce irregular parallelism on custom data types, which are difficult for GPUs to handle but would be a great fit for FPGA's extreme customizability.

This paper evaluates a selection of emerging DNN algorithms on two generations of Intel FPGAs (Arria'10, Stratix'10) against the latest highest performance Titan X Pascal GPU. We created a customizable DNN accelerator template for FPGAs and used it in our evaluations. First, we study various GEMM operations for next-generation DNNs. Our results show that Stratix 10 FPGA is 10%, 50%, and 5.4x better in performance (TOP/sec) than Titan X Pascal GPU on GEMM operations for pruned, Int6, and binarized DNNs, respectively. Then, we present a detailed case study on accelerating Ternary ResNet which relies on sparse GEMM on 2-bit weights (i.e., weights constrained to 0,+1,-1) and full-precision neurons. The Ternary ResNet accuracy is within ~1% of the full-precision ResNet which won the 2015 ImageNet competition. On Ternary-ResNet, the Stratix 10 FPGA can deliver 60% better performance over Titan X Pascal GPU, while being 2.3x better in performance/watt. Our results indicate that FPGAs may become the platform of choice for accelerating next-generation DNNs.

References

  1. M. Courbariaux, Y. Bengio, J-P. David "BinaryConnect: Training Deep Neural Networks with binary weights during propagations," NIPS 2015.Google ScholarGoogle Scholar
  2. M. Courbariaux, I. Hubara, et al., "Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1," arXiv:1602.02830 [cs.LG].Google ScholarGoogle Scholar
  3. M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks," arXiv:1603.05279 [cs.CV]Google ScholarGoogle Scholar
  4. F. Li, B. Liu. "Ternary Weight Networks," arXiv:1605.04711 [cs.CV]Google ScholarGoogle Scholar
  5. G. Venkatesh, E. Nurvitadhi, D. Marr, ".Accelerating Deep Convolutional Networks Using Low-Precision and Sparsity," ICASSP, 2017.Google ScholarGoogle Scholar
  6. S. Han, H. Mao, W. J. Dally, "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding," ICLR 2016.Google ScholarGoogle Scholar
  7. P. Gysel, et al., "Hardware-Oriented Approximation of Convolutional Neural Networks," ICLR Workshop 2016.Google ScholarGoogle Scholar
  8. J. Albericio, P. Judd, T. Hetherington, et al, "Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional Neural Network Computing," ISCA 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Han, X. Liu, et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network," ISCA 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Suda, V. Chandra, et al., "Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks," ISFPGA 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Qiu, et al., "Going Deeper with Embedded FPGA Platform for Convolutional Neural Network," ISFPGA 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P.K. Gupta, "Accelerating Datacenter Workloads," Keynote at FPL 2016. Slides available at www.fpl2016.org.Google ScholarGoogle Scholar
  13. A. Putnam, A. M. Caulfield, et al., "A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services," ISCA 2014. Google ScholarGoogle ScholarCross RefCross Ref
  14. S. Y. Kung, "VLSI Array Processors," Prentice-Hall, Inc. Upper Saddle River, NJ, USA, 1987.Google ScholarGoogle Scholar
  15. A. Pedram, et al., "A High-Performance, Low-Power Linear Algebra Core," ASAP 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Altera Arria 10 Website. https://www.altera.com/products/fpga/arria-series/arria-10/overview.htmlGoogle ScholarGoogle Scholar
  17. Altera Stratix 10 Website. https://www.altera.com/products/fpga/stratix-series/stratix-10/overview.htmlGoogle ScholarGoogle Scholar
  18. Nvidia Titan X Website. http://www.geforce.com/hardware/10series/titan-x-pascalGoogle ScholarGoogle Scholar
  19. Altera's PowerPlay Early Power Estimators (EPE) and Power Analyzer, https://www.altera.com/support/support-resources/operation-and-testing/power/pow-powerplay.htmlGoogle ScholarGoogle Scholar
  20. S. Gross, M. Wilber, "Training and investigating Residual Nets," http://torch.ch/blog/2016/02/04/resnets.htmlGoogle ScholarGoogle Scholar
  21. J. C. Johnson, "cnn-benchmarks", available at https://github.com/jcjohnson/cnn-benchmarksGoogle ScholarGoogle Scholar
  22. G. Baeckler, "HyperPipelining of High-Speed Interface Logic," ISFPGA Tutorial, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Lavin, S. Gray, "Fast Algorithms for Convolutional Neural Networks," arXiv:1509.09308 [cs.NE].Google ScholarGoogle Scholar
  24. P. D'Alberto, P. A. Milder, et al., "Generating FPGA Accelerated DFT Libraries," FCCM 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. W. Chen, J. Wilson, et al., "Compressing Neural Networks with the Hashing Trick," ICML 2015.Google ScholarGoogle Scholar
  26. Y. Chen, T. Luo, S. Liu, et al., "Dadiannao: A machine-learning supercomputer," Int. Symposium on Microarchitecture (MICRO), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Kestur, et al., "BLAS Comparison on FPGA, CPU and GPU," IEEE Annual Sym. on VLSI (ISVLSI), 2010Google ScholarGoogle Scholar
  28. E. Nurvitadhi, J. Sim, D. Sheffield, et al, "Accelerating Recurrent Neural Networks in Analytics Servers: Comparison of FPGA, CPU, GPU, and ASIC," FPL 2016.Google ScholarGoogle Scholar
  29. MAGMA: Matrix Algebra on GPU and Multicore Architectures. Website: http://icl.cs.utk.edu/magma/Google ScholarGoogle Scholar
  30. E. Nurvitadhi, D. Sheffield, J. Sim, et al, "Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC," FPT 2016.Google ScholarGoogle Scholar

Index Terms

  1. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
        February 2017
        312 pages
        ISBN:9781450343541
        DOI:10.1145/3020078

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 February 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        FPGA '17 Paper Acceptance Rate25of101submissions,25%Overall Acceptance Rate125of627submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader