research-article

Public Access

SFU-Driven Transparent Approximation Acceleration on GPUs

Authors:

Shuaiwen Leon Song,

Mark Wijtvliet,

Henk CorporaalAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 15, Pages 1 - 14

https://doi.org/10.1145/2925426.2926255

Published: 01 June 2016 Publication History

Abstract

Approximate computing, the technique that sacrifices certain amount of accuracy in exchange for substantial performance boost or power reduction, is one of the most promising solutions to enable power control and performance scaling towards exascale. Although most existing approximation designs target the emerging data-intensive applications that are comparatively more error-tolerable, there is still high demand for the acceleration of traditional scientific applications (e.g., weather and nuclear simulation), which often comprise intensive transcendental function calls and are very sensitive to accuracy loss. To address this challenge, we focus on a very important but long ignored approximation unit on today's commercial GPUs --- the special-function unit (SFU), and clarify its unique role in performance acceleration of accuracy-sensitive applications in the context of approximate computing. To better understand its features, we conduct a thorough empirical analysis on three generations of NVIDIA GPU architectures to evaluate all the single-precision and double-precision numeric transcendental functions that can be accelerated by SFUs, in terms of their performance, accuracy and power consumption. Based on the insights from the evaluation, we propose a transparent, tractable and portable design framework for SFU-driven approximate acceleration on GPUs. Our design is software-based and requires no hardware or application modifications. Experimental results on three NVIDIA GPU platforms demonstrate that our proposed framework can provide fine-grained tuning for performance and accuracy trade-offs, thus facilitating applications to achieve the maximum performance under certain accuracy constraints.

References

[1]

Swagath Venkataramani, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. Approximate Computing and the Quest for Computing Efficiency. In Proceedings of the 52nd Annual Design Automation Conference, DAC'15, pages 120:1--120:6. ACM, 2015.

Digital Library

[2]

Sasa Misailovic, Stelios Sidiroglou, Henry Hoffmann, and Martin Rinard. Quality of Service Profiling. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE'10, pages 25--34. ACM, 2010.

Digital Library

[3]

Hwu Wen-Mei. GPU Computing Gems Emerald Edition. Elsevier, 2011.

Digital Library

[4]

Mehrzad Samadi, Janghaeng Lee, D Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. Sage: Self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 13--24. ACM, 2013.

Digital Library

[5]

Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. Paraprox: Pattern-based Approximation for Data Parallel Applications. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'14, pages 35--50. ACM, 2014.

Digital Library

[6]

John Sartori and Ravindra Kumar. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. IEEE Transactions on Multimedia, 15(2):279--290, 2013.

Digital Library

[7]

Jose-Maria Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis. Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization. In Proceedings of the 41st ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 529--540. IEEE, 2014.

Digital Library

[8]

Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Taesoo Kim, Onur Mutlu, and Todd C Mowry. RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads. In Proceedings of the 11th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC). ACM, 2016.

[9]

Mark Sutherland, Joshua San Miguel, and Natalie Enright Jerger. Texture Cache Approximation on GPUs. 2015.

[10]

David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. Detection and Correction of Silent Data Corruption for Large-scale High-performance Computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC'12, pages 78:1--78:12. IEEE Computer Society Press, 2012.

Digital Library

[11]

Rizwan A. Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald F. DeMara, Chen-Yong Cher, and Pradip Bose. Understanding the Propagation of Transient Errors in HPC Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'15, pages 72:1--72:12. ACM, 2015.

Digital Library

[12]

Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, (2):39--55, 2008.

Digital Library

[13]

Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-40, pages 407--420. IEEE Computer Society, 2007.

Digital Library

[14]

NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110, 2015.

[15]

Vasily Volkov. Better performance at lower occupancy. In Proceedings of the GPU Technology Conference (GTC), volume 10. San Jose, CA, 2010.

[16]

Nicholas Wilt. The CUDA handbook: A comprehensive guide to GPU programming. Pearson Education, 2013.

[17]

David A Patterson and John L Hennessy. Computer organization and design: the hardware/software interface. Newnes, 2013.

Digital Library

[18]

Stuart F Oberman and Michael Y Siu. A high-performance area-efficient multifunction interpolator. In Proceedings of the 17th IEEE Symposium on Computer Arithmetic (ARITH), pages 272--279. IEEE, 2005.

Digital Library

[19]

Davide De Caro, Nicola Petra, and Antonio GM Strollo. High-performance special function unit for programmable 3-D graphics processors. IEEE Transactions on Circuits and Systems I: Regular Papers (TCAS-I), 56(9):1968--1978, 2009.

Digital Library

[20]

NVIDIA. CUDA Math API, 2015.

[21]

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), pages 487--498. ACM, 2013.

Digital Library

[22]

NVIDIA. NVIDIA system management interface, 2015.

[23]

Zeyuan Allen Zhu, Sasa Misailovic, Jonathan A Kelner, and Martin Rinard. Randomized accuracy-aware program transformations for efficient approximate computations. In ACM SIGPLAN Notices, volume 47, pages 441--454. ACM, 2012.

Digital Library

[24]

Pooja Roy, Jianxing Wang, and Weng Fai Wong. PAC: program analysis for approximation-aware compilation. In Proceedings of the 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, pages 69--78. IEEE Press, 2015.

Digital Library

[25]

Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 124--134. ACM, 2011.

Digital Library

[26]

Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. Adaptive and Transparent Cache Bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'15, pages 17:1--17:12. ACM, 2015.

Digital Library

[27]

NVIDIA. Inline PTX Assembly in CUDA, 2015.

[28]

NVIDIA. CUDA SDK Code Samples, 2015.

[29]

Ang Li, Gert-Jan van den Braak, Henk Corporaal, and Akash Kumar. Fine-Grained Synchronizations and Dataflow Programming on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS), pages 109--118. ACM, 2015.

Digital Library

[30]

Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 163--174. IEEE, 2009.

[31]

John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W-M Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012.

[32]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2009.

Digital Library

[33]

Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar). IEEE, 2012.

[34]

Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S Meredith, Philip C Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU), pages 63--74. ACM, 2010.

Digital Library

[35]

Ismail Akturk, Karen Khatamifard, and Ulya R Karpuzcu. On quantification of accuracy loss in approximate computing. In Workshop on Duplicating, Deconstructing and Debunking (WDDD), page 15, 2015.

[36]

Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. Approximate storage in solid-state memories. ACM Transactions on Computer Systems (TOCS), 32(3):9, 2014.

Digital Library

[37]

Hyungmin Cho, Larkhoon Leem, and Subhasish Mitra. ERSA: Error resilient system architecture for probabilistic applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 31(4):546--558, 2012.

Digital Library

[38]

Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Architecture Support for Disciplined Approximate Programming. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 301--312. ACM, 2012.

Digital Library

[39]

Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. General-purpose Code Acceleration with Limited-precision Analog Computation. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA'14, pages 505--516. IEEE Press, 2014.

Digital Library

[40]

Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Neural Acceleration for General-Purpose Approximate Programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 449--460. IEEE Computer Society, 2012.

Digital Library

[41]

Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. Relax: An Architectural Framework for Software Recovery of Hardware Faults. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA'10, pages 497--508. ACM, 2010.

Digital Library

[42]

Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. Load Value Approximation. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pages 127--139. IEEE Computer Society, 2014.

Digital Library

[43]

Martin Rinard. Probabilistic Accuracy Bounds for Fault-tolerant Computations That Discard Tasks. In Proceedings of the 20th Annual International Conference on Supercomputing, ICS'06, pages 324--334. ACM, 2006.

Digital Library

[44]

Woongki Baek and Trishul M Chilimbi. Green: a framework for supporting energy-conscious programming using controlled approximation. In ACM Sigplan Notices, volume 45, pages 198--209. ACM, 2010.

Digital Library

[45]

Amir Yazdanbakhsh, Jongse Park, Hardik Sharma, Pejman Lotfi-Kerman, and Hadi Esmaeilzadeh. Neural Acceleration for GPU Throughput Processors. 2015.

[46]

Nicolas Brisebarre, Jean-Michel Muller, and Arnaud Tisserand. Sparse-coefficient polynomial approximations for hardware implementations. In Signals, Systems and Computers, 2004. Conference Record of the Thirty-Eighth Asilomar Conference on, volume 1, pages 532--535. IEEE, 2004.

Cited By

Zhou KSubramanian KLin PFey MYin BLi J(2024)FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural NetworksProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656593(511-524)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656593
Wang YHao MHe HZhang WTang QSun XWang Z(2024)DRLCAP: Runtime GPU Frequency Capping With Deep Reinforcement LearningIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33626979:5(712-726)Online publication date: Sep-2024
https://doi.org/10.1109/TSUSC.2024.3362697
Li XLaguna IFang BSwirydowicz KLi AGopalakrishnan GButt AMi NChard K(2023)Design and Evaluation of GPU-FPX: A Low-Overhead tool for Floating-Point Exception Detection in NVIDIA GPUsProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592991(59-71)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3592991
Show More Cited By

Recommendations

Neural acceleration for GPU throughput processors
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Graphics Processing Units (GPUs) can accelerate diverse classes of applications, such as recognition, gaming, data analytics, weather prediction, and multimedia. Many of these applications are amenable to approximate execution. This application ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
An OpenCL micro-benchmark suite for GPUs and CPUs

Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

U.S. Department of Energy

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
1,518
Total Downloads

Downloads (Last 12 months)546
Downloads (Last 6 weeks)47

Reflects downloads up to 09 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou KSubramanian KLin PFey MYin BLi J(2024)FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural NetworksProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656593(511-524)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656593
Wang YHao MHe HZhang WTang QSun XWang Z(2024)DRLCAP: Runtime GPU Frequency Capping With Deep Reinforcement LearningIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33626979:5(712-726)Online publication date: Sep-2024
https://doi.org/10.1109/TSUSC.2024.3362697
Li XLaguna IFang BSwirydowicz KLi AGopalakrishnan GButt AMi NChard K(2023)Design and Evaluation of GPU-FPX: A Low-Overhead tool for Floating-Point Exception Detection in NVIDIA GPUsProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592991(59-71)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3592991
Zhang WZhao CPeng LLin YZhang FLu YDehnavi MKulkarni MKrishnamoorthy S(2023)Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based SynchronizationProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577474(1-13)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577474
Yue HWei XTan JJiang NQiu M(2022)Eff-ECC: Protecting GPGPUs Register File With a Unified Energy-Efficient ECC MechanismIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.310452941:7(2080-2093)Online publication date: Jul-2022
https://doi.org/10.1109/TCAD.2021.3104529
Ma JHashemi SReda S(2022)Approximate Logic Synthesis Using Boolean Matrix FactorizationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.305460341:1(15-28)Online publication date: Jan-2022
https://doi.org/10.1109/TCAD.2021.3054603
Shen JLong BHuang C(2022)Optimizing Fast Trigonometric Functions on Modern CPUs2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00162(1022-1029)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00162
Corda SVeenboer BAwan ARomein JJordans RKumar ABoonstra ACorporaal H(2022)Reduced-Precision Acceleration of Radio-Astronomical Imaging on Reconfigurable HardwareIEEE Access10.1109/ACCESS.2022.315086110(22819-22843)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3150861
Li AFang BGranade CPrawiroatmodjo GHeim BRoetteler MKrishnamoorthy Sde Supinski BHall MGamblin T(2021)SV-simProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476169(1-14)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476169
Ho Nsilva HWong W(2021)GRAMACM Transactions on Architecture and Code Optimization10.1145/344183018:2(1-24)Online publication date: 9-Feb-2021
https://dl.acm.org/doi/10.1145/3441830
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten