Article

A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors

Authors:

Jeremy W. Sheaffer,

David P. Luebke,

Kevin SkadronAuthors Info & Claims

GH '07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware

Pages 55 - 64

Published: 04 August 2007 Publication History

Abstract

General purpose computation on graphics processors (GPGPU) has rapidly evolved since the introduction of commodity programmable graphics hardware. With the appearance of GPGPU computation-oriented APIs such as AMD's Close to the Metal (CTM) and NVIDIA's Compute Unified Device Architecture (CUDA), we begin to see GPU vendors putting financial stakes into this non-graphics, one-time niche market. Major supercomputing installations are building GPGPU clusters to take advantage of massively parallel floating point capabilities, and Folding@Home has even released a GPU port of its protein folding distributed computation client. But in order for GPGPU to truly become important to the supercomputing community, vendors will have to address the heretofore unimportant reliability concerns of graphics processors. We present a hardware redundancy-based approach to reliability for general purpose computation on GPUs that requires minimal change to existing GPU architectures. Upon detecting an error, the system invokes an automatic recovery mechanism that only recomputes erroneous results. Our results show that our technique imposes less than a 1.5 x performance penalty and saves energy for GPGPU but is completely transparent to general graphics and does not affect the performance of the games that drive the market.

References

[1]

{AMN03} Aila T., Miettinen V., Nordlund P.: Delay Streams for Graphics Hardware. ACM Transactions on Graphics 22, 3 (2003), 792--800.

Digital Library

[2]

{BBS*03} Brooks D., Bose P., Srinivasan V., Gschwind M., Emma P. G., Rosenfield M. G.: New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors. IBM Journal of R & D 47, 5/6 (2003).

Digital Library

[3]

{BK05} Brown P., Kilgard M. J.: NV_Fragment_-Program, May 2005. http://www.opengl.org/registry/-specs/NV/fragment_program.txt.

[4]

{Bor05} Borkar S.: Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro 25, 6 (Nov./Dec. 2005), 10--16.

Digital Library

[5]

{GLW*04} Govindaraju N. K., Lloyd B., Wang W., Lin M., Manocha D.: Fast computation of database operations using graphics processors. In SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data (New York, NY, USA, 2004), ACM Press, pp. 215--226.

Digital Library

[6]

{GSVP03} Gomaa M. A., Scarbrough C., Vijaykumar T. N., Pomeranz I.: Transient-Fault Recovery for Chip Multiprocessors. IEEE Micro 23, 6 (2003), 76--83.

Digital Library

[7]

{HKM*03} Hazucha P., Karnik T., Maiz J., Walstra S., Bloechel B., Tschanz J., Dermer G., Hareland S., Armstrong P., Borkar S.: Neutron Soft Error Rate Measurements in a 90-nm CMOS Process and Scaling Trends in SRAM from 0.25-μm to 90-nm Generation. In IEEE International Electron Devices Meeting 2003 Technical Digest (Dec. 2003), IEEE, pp. 523--526.

[8]

{Hou07} Houston M.: Personal communication, Mar. 2007. Stanford University and Folding@Home.

[9]

{HP03} Hennessy J. L., Patterson D. A.: Computer architecture: a quantitative approach, 3rd ed. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.

Digital Library

[10]

{HPS05} Houston M., Preetham A. J., Segal M. A.: A Hardware F-Buffer Implementation. Tech. rep., Stanford University., 2005.

[11]

{JLBM05} Johnson G. S., Lee J., Burns C. A., Mark W. R.: The Irregular Z-buffer: Hardware Acceleration for Irregular Data Structures. ACM Trans. Graph. 24, 4 (2005), 1462--1482.

Digital Library

[12]

{MER05} Mukherjee S. S., Emer J. S., Reinhardt S. K.: The Soft Error Problem: An Architectural Perspective. In HPCA (2005), IEEE, IEEE Computer Society, pp. 243--247.

Digital Library

[13]

{MKR02} Mukherjee S. S., Kontz M., Reinhardt S. K.: Detailed Design and Evaluation of Redundant Multithreading Alternatives. In ISCA (2002), IEEE, IEEE Computer Society, pp. 99--110.

Digital Library

[14]

{Mor66} Morton G. M.: A computer oriented geodetic data base and a new technique in file sequencing, 1966. IBM Canada.

[15]

{MP01} Mark W. R., Proudfoot K.: The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering. In Proceedings of the SIGGRAPH/Eurographics Graphics Hardware Workshop 2001 (2001).

Digital Library

[16]

{NVI07} NVIDIA: NVIDIA CUDA compute unified device architecture programming guide, 2007. http://developer.download.nvidia.com/compute/cuda/08/-NVIDIA_CUDA_Programming_Guide_0.8.pdf.

[17]

{PGS04} Parashar A., Gurumurthi S., Sivasubramaniam A.: A Complexity-Effective Approach to ALU Bandwidth Enhancement for Instruction-Level Temporal Redundancy. In ISCA (2004), IEEE, IEEE Computer Society, pp. 376--386.

Digital Library

[18]

{RCV*05} Reis G. A., Chang J., Vachharajani N., Rangan R., August D. I., Mukherjee S. S.: Design and Evaluation of Hybrid Fault-Detection Systems. In ISCA (2005), pp. 148--159.

Digital Library

[19]

{RM00} Reinhardt S. K., Mukherjee S. S.: Transient fault detection via simultaneous multithreading. In ISCA (2000), pp. 25--36.

Digital Library

[20]

{SABR04} Srinivasan J., Adve S. V., Bose P., Rivers J. A.: The Impact of Technology Scaling on Lifetime Reliability. In DSN (2004), IEEE, IEEE Computer Society.

Digital Library

[21]

{SLS06} Sheaffer J. W., Luebke D. P., Skadron K.: The Visual Vulnerability Spectrum: Characterizing Architectural Vulnerability for Graphics Hardware. In Proceedings of Graphics Hardware 2006 (Sept. 2006).

Digital Library

[22]

{Sys05} SystemC Language Reference Manual Working Group: Draft standard SystemC language reference manual (version 2.1), April 2005. http://www.systemc.org/.

[23]

{TOP94} TOP500.Org: June 1994lTOP500 Supercomputing Sites, June 1994. http://www.top500.org/lists/-1994/06.

[24]

{TTJ06} Tarjan D., Thoziyoor S., Jouppi N. P.: CACTI 4.0. Tech. Rep. HPL-2006-86, HP Laboratories Palo Alto, June 2006.

[25]

{VPC02} Vijaykumar T. N., Pomeranz I., Cheng K.: Transient-fault recovery using simultaneous multithreading. In ISCA '02: Proceedings of the 29th annual international symposium on Computer architecture (Washington, DC, USA, 2002), IEEE Computer Society, pp. 87--98.

Digital Library

Cited By

Lee HChen HAl Faruque MFanucci LTeich J(2016)PAISProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972174(1568-1573)Online publication date: 14-Mar-2016
https://dl.acm.org/doi/10.5555/2971808.2972174
Ubal RSchaa DMistry PGong XUkidave YChen ZSchirner GKaeli D(2014)Exploring the Heterogeneous Design Space for both Performance and ReliabilityProceedings of the 51st Annual Design Automation Conference10.1145/2593069.2596680(1-6)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1145/2593069.2596680
Jia YLuszczek PBosilca GDongarra J(2013)CPU-GPU hybrid bidiagonal reduction with soft error resilienceProceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems10.1145/2530268.2530270(1-5)Online publication date: 17-Nov-2013
https://dl.acm.org/doi/10.1145/2530268.2530270
Show More Cited By

A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors
1. Computer systems organization

Recommendations

Graphics hardware for scientific computation

Modern Graphics Processing Units (GPUs) commonly found in today's PCs feature multiple processing units and can be used for general purpose computations and in particular, parallel numerical algorithms. But the available level of abstraction is still ...
Physical challenges in reliable graphics hardware design
Stream computing on graphics hardware

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

GH '07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware

August 2007

119 pages

ISBN:9781595936257

Editors:
Dieter Fellner
TU Braunschweig, Germany
,
Stephen Spencer
The University of Washington

Sponsors

SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques
EUROGRAPHICS: The European Association for Computer Graphics

Publisher

Eurographics Association

Goslar, Germany

Publication History

Published: 04 August 2007

Check for updates

Qualifiers

Article

Conference

GH07

Sponsor:

SIGGRAPH
EUROGRAPHICS

GH07: Graphics Hardware

August 4 - 5, 2007

California, San Diego

Acceptance Rates

GH '07 Paper Acceptance Rate 12 of 30 submissions, 40%;

Overall Acceptance Rate 50 of 126 submissions, 40%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
38
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lee HChen HAl Faruque MFanucci LTeich J(2016)PAISProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972174(1568-1573)Online publication date: 14-Mar-2016
https://dl.acm.org/doi/10.5555/2971808.2972174
Ubal RSchaa DMistry PGong XUkidave YChen ZSchirner GKaeli D(2014)Exploring the Heterogeneous Design Space for both Performance and ReliabilityProceedings of the 51st Annual Design Automation Conference10.1145/2593069.2596680(1-6)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1145/2593069.2596680
Jia YLuszczek PBosilca GDongarra J(2013)CPU-GPU hybrid bidiagonal reduction with soft error resilienceProceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems10.1145/2530268.2530270(1-5)Online publication date: 17-Nov-2013
https://dl.acm.org/doi/10.1145/2530268.2530270
Tan JLi ZFu XFranke HHeinecke APalem KUpfal E(2013)Cost-effective soft-error protection for SRAM-based structures in GPGPUsProceedings of the ACM International Conference on Computing Frontiers10.1145/2482767.2482804(1-10)Online publication date: 14-May-2013
https://dl.acm.org/doi/10.1145/2482767.2482804
Xiang PYang YMantor MRubin NHsu LZhou HMalony ANemirovsky MMidkiff S(2013)Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancementProceedings of the 27th international ACM conference on International conference on supercomputing10.1145/2464996.2465022(433-442)Online publication date: 10-Jun-2013
https://dl.acm.org/doi/10.1145/2464996.2465022
Menon JDe Kruijf MSankaralingam KLu STorrellas J(2012)iGPUProceedings of the 39th Annual International Symposium on Computer Architecture10.5555/2337159.2337168(72-83)Online publication date: 9-Jun-2012
https://dl.acm.org/doi/10.5555/2337159.2337168
Tan JFu XYew PCho SDeRose LLilja D(2012)RISEProceedings of the 21st international conference on Parallel architectures and compilation techniques10.1145/2370816.2370846(191-200)Online publication date: 19-Sep-2012
https://dl.acm.org/doi/10.1145/2370816.2370846
Menon JDe Kruijf MSankaralingam K(2012)iGPUACM SIGARCH Computer Architecture News10.1145/2366231.233716840:3(72-83)Online publication date: 9-Jun-2012
https://dl.acm.org/doi/10.1145/2366231.2337168
Du PLuszczek PTomov SDongarra JAlexandrov VGeist ADongarra J(2011)Soft error resilient QR factorization for hybrid system with GPGPUProceedings of the second workshop on Scalable algorithms for large-scale systems10.1145/2133173.2133179(11-14)Online publication date: 14-Nov-2011
https://dl.acm.org/doi/10.1145/2133173.2133179
Haque IPande V(2010)Hard Data on Soft ErrorsProceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing10.1109/CCGRID.2010.84(691-696)Online publication date: 17-May-2010
https://dl.acm.org/doi/10.1109/CCGRID.2010.84
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten