research-article

Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks

Authors:

Jorge Albericio,

Tayler Hetherington,

Natalie Enright Jerger,

Andreas MoshovosAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 23, Pages 1 - 12

https://doi.org/10.1145/2925426.2926294

Published: 01 June 2016 Publication History

Abstract

This work exploits the tolerance of Deep Neural Networks (DNNs) to reduced precision numerical representations and specifically, their recently demonstrated ability to tolerate representations of different precision per layer while maintaining accuracy. This flexibility enables improvements over conventional DNN implementations that use a single, uniform representation. This work proposes Proteus, which reduces the data traffic and storage footprint needed by DNNs, resulting in reduced energy and improved area efficiency for DNN implementations. Proteus uses a different representation per layer for both the data (neurons) and the weights (synapses) processed by DNNs. Proteus is a layered extension over existing DNN implementations that converts between the numerical representation used by the DNN execution engines and the shorter, layer-specific fixed-point representation used when reading and writing data values to memory be it on-chip buffers or off-chip memory. Proteus uses a novel memory layout for DNN data, enabling a simple, low-cost and low-energy conversion unit.

We evaluate Proteus as an extension to a state-of-the-art accelerator [7] which uses a uniform 16-bit fixed-point representation. On five popular DNNs Proteus reduces data traffic among layers by 43% on average while maintaining accuracy within 1% even when compared to a single precision floating-point implementation. As a result, Proteus improves energy by 15% with no performance loss. Proteus also reduces the data footprint by at least 38% and hence the amount of on-chip buffering needed resulting in an implementation that requires 20% less area overall. This area savings can be used to improve cost by building smaller chips, to process larger DNNs for the same on-chip area, or to incorporate an additional three execution engines increasing peak performance bandwidth by 18%.

References

[1]

AMD. AMD GRAPHICS CORES NEXT (GCN). Whitepaper. "https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf", 2012.

[2]

S. Anwar, K. Hwang, and W. Sung. Fixed point optimization of deep convolutional neural networks for object recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1131--1135, Apr. 2015.

[3]

K. Asanovic and N. Morgan. Using simulations of reduced precision arithmetic to design a neuro-microprocessor. Journal of VLSI Signal Processing, pages 33--44, 1993.

[4]

I. Buck. NVIDIA's Next-Gen Pascal GPU Architecture to Provide 10X Speedup for Deep Learning Apps. "http://blogs.nvidia.com/blog/2015/03/17/pascal/", 2015.

[5]

M. Burrows and D. Wheeler. A Block-sorting Lossless Data Compression Algorithm. Number no. 124. Digital, Systems Research Center, 1994.

[6]

T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, 2014.

Digital Library

[7]

Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. DaDianNao: A machine-learning supercomputer. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 609--622, Dec 2014.

Digital Library

[8]

D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Mitosis detection in breast cancer histology images with deep neural networks. In MICCAI, 2013.

[9]

M. Courbariaux, Y. Bengio, and J. David. Low precision arithmetic for deep learning. CoRR, abs/1412.7024, 2014.

[10]

G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing (receiving 2013 IEEE SPS Best Paper Award), 20(1):30--42, January 2012.

Digital Library

[11]

Z. Deng, C. Xu, Q. Cai, and P. Faraboschi. Reduced-precision memory value approximation for deep learning. 2015.

[12]

H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 365--376, New York, NY, USA, 2011. ACM.

Digital Library

[13]

R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR, abs/1311.2524, 2013.

Digital Library

[14]

S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. CoRR, abs/1502.02551, 2015.

[15]

S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: efficient inference engine on compressed deep neural network. CoRR, abs/1602.01528, 2016.

[16]

S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.

[17]

J. Holt and T. Baker. Back propagation simulations using limited precision calculations. In Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on, volume ii, pages 121--126 vol.2, Jul 1991.

[18]

J. L. Holt and J.-N. Hwang. Finite precision error analysis of neural network hardware implementations. IEEE Trans. on Computers, 42:281--290, 1993.

Digital Library

[19]

Y. Jia. Caffe model zoo. https://github.com/BVLC/caffe/wiki/Model-Zoo, 2015.

[20]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

[21]

P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. Enright Jerger, R. Urtasun, and A. Moshovos. Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets. arXiv:1511.05236 {cs}, Nov. 2015. arXiv: 1511.05236.

[22]

J. Kim, K. Hwang, and W. Sung. X1000 real-time phoneme recognition VLSI using feed-forward deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7510--7514, May 2014.

[23]

A. Krizhevsky. cuda-convnet: High-performance C++/CUDA implementation of convolutional neural networks. https://code.google.com/p/cuda-convnet/.

[24]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097--1105. Curran Associates, Inc., 2012.

Digital Library

[25]

D. Larkin and A. Kinane. Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices.

[26]

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, Nov 1998.

[27]

M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.

[28]

N. Muralimanohar and R. Balasubramonian. CACTI 6.0: A tool to understand large caches.

[29]

G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. Base-delta-immediate compression: Practical data compression for on-chip caches. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 377--388, New York, NY, USA, 2012. ACM.

Digital Library

[30]

M. Poremba, S. Mittal, D. Li, J. Vetter, and Y. Xie. DESTINY: A tool for modeling emerging 3D NVM and eDRAM caches. In Design, Automation Test in Europe Conference Exhibition (DATE), 2015, pages 1543--1546, March 2015.

Digital Library

[31]

R. Presley and R. Haggard. A fixed point implementation of the backpropagation learning algorithm. In Southeastcon '94. Creative Technology Transfer - A Global Affair., Proceedings of the 1994 IEEE, pages 136--138, Apr 1994.

[32]

P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A cycle accurate memory system simulator. IEEE Comput. Archit. Lett., 10(1):16--19, Jan. 2011.

Digital Library

[33]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575 {cs}, Sept. 2014. arXiv: 1409.0575.

[34]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.

Digital Library

[35]

F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech dnns. In Interspeech 2014, September 2014.

[36]

A. Strey and N. Avellana. A new concept for parallel neurocomputer architectures, 1996.

[37]

Synopsys. Design Compiler. http://www.synopsys.com/Tools/Implementation/RTLSynthesis/DesignCompiler/Pages/default.aspx.

[38]

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.

[39]

Y. Xie and M. A. Jabri. Training algorithms for limited precision feedforward neural networks. Technical report, 1991.

Cited By

Hong DTsai TWang NLiu PWu J(2025)GPU Memory Usage Optimization for Backward Propagation in Deep Network TrainingJournal of Parallel and Distributed Computing10.1016/j.jpdc.2025.105053(105053)Online publication date: Feb-2025
https://doi.org/10.1016/j.jpdc.2025.105053
Lascorz AMahmoud MZadeh ANikolic MIbrahim KGiannoula CAbdelhadi AMoshovos ATsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Atalanta: A Bit is Worth a “Thousand” Tensor ValuesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640356(85-102)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640356
Li ZLu ZJia WYu RZhang HZhou GLiu ZQu G(2024)Efficient Approximate Floating-Point Multiplier With Runtime Reconfigurable Frequency and PrecisionIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.336471771:7(3533-3537)Online publication date: Jul-2024
https://doi.org/10.1109/TCSII.2024.3364717
Show More Cited By

Recommendations

Proteus: a flexible and fast software supported hardware logging approach for NVM
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Emerging non-volatile memory (NVM) technologies, such as phase-change memory, spin-transfer torque magnetic memory, memristor, and 3D Xpoint, are encouraging the development of new architectures that support the challenges of persistent programming. An ...
Proteus: Simulating the Performance of Distributed DNN Training
DNN models are becoming increasingly larger to achieve unprecedented accuracy, and the accompanying increased computation and memory requirements necessitate the employment of massive clusters and elaborate parallelization strategies to accelerate DNN ...
Proteus: A Runtime Reconfigurable Distributed Shared Memory System
HPDC '99: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing

This paper describes a Distributed Shared Memory (DSM) system called Proteus, which aims to support runtime node reconfiguration. To make the system execute efficiently after node reconfiguration and reduce the overhead of reconfiguration, Proteus ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

84
Total Citations
View Citations
1,119
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hong DTsai TWang NLiu PWu J(2025)GPU Memory Usage Optimization for Backward Propagation in Deep Network TrainingJournal of Parallel and Distributed Computing10.1016/j.jpdc.2025.105053(105053)Online publication date: Feb-2025
https://doi.org/10.1016/j.jpdc.2025.105053
Lascorz AMahmoud MZadeh ANikolic MIbrahim KGiannoula CAbdelhadi AMoshovos ATsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Atalanta: A Bit is Worth a “Thousand” Tensor ValuesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640356(85-102)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640356
Li ZLu ZJia WYu RZhang HZhou GLiu ZQu G(2024)Efficient Approximate Floating-Point Multiplier With Runtime Reconfigurable Frequency and PrecisionIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.336471771:7(3533-3537)Online publication date: Jul-2024
https://doi.org/10.1109/TCSII.2024.3364717
Ebrahimi ZKumar A(2024)GREEN: An Approximate SIMD/MIMD CGRA for Energy-Efficient Processing at the EdgeIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.338334943:10(2874-2887)Online publication date: Oct-2024
https://doi.org/10.1109/TCAD.2024.3383349
Tang KWang JXu WJi XLiu JHuang XXin YDai PSun GZeng ZXiao RChen XJiang W(2024)Photonic Tensor Processing Unit With Single Dataflow and Programmable High-Precision Weighting ControlJournal of Lightwave Technology10.1109/JLT.2023.331709042:2(659-669)Online publication date: 15-Jan-2024
https://doi.org/10.1109/JLT.2023.3317090
Nikolić MHacene GBannon CLascorz ACourbariaux MAwad OEdo Vivancos IBengio YGripon VMoshovos A(2024)BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558100(1-5)Online publication date: 19-May-2024
https://doi.org/10.1109/ISCAS58744.2024.10558100
Buck AGanesan KJerger N(2024)FlipBit: Approximate Flash Memory for IoT Devices2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00072(876-890)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00072
Raut GThakur REdavoor PSelvakumar D(2024)AFX-PE: Adaptive Fixed-Point Processing Engine for Neural Network AcceleratorsVLSI for Embedded Intelligence10.1007/978-981-97-3756-7_8(87-104)Online publication date: 28-Oct-2024
https://doi.org/10.1007/978-981-97-3756-7_8
Jung JKim JLee JAamodt TJerger NSwift M(2023)DeepUM: Tensor Migration and Prefetching in Unified MemoryProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575736(207-221)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575736
Wang YHooi BLiu YShah NChua TLauw HSi LTerzi ETsaparas P(2023)Graph Explicit Neural Networks: Explicitly Encoding Graphs for Efficient and Accurate InferenceProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570388(348-356)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3539597.3570388
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten