skip to main content
10.5555/1280094.1280104acmconferencesArticle/Chapter ViewAbstractPublication PageshpgConference Proceedingsconference-collections
Article

A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors

Published: 04 August 2007 Publication History

Abstract

General purpose computation on graphics processors (GPGPU) has rapidly evolved since the introduction of commodity programmable graphics hardware. With the appearance of GPGPU computation-oriented APIs such as AMD's Close to the Metal (CTM) and NVIDIA's Compute Unified Device Architecture (CUDA), we begin to see GPU vendors putting financial stakes into this non-graphics, one-time niche market. Major supercomputing installations are building GPGPU clusters to take advantage of massively parallel floating point capabilities, and Folding@Home has even released a GPU port of its protein folding distributed computation client. But in order for GPGPU to truly become important to the supercomputing community, vendors will have to address the heretofore unimportant reliability concerns of graphics processors. We present a hardware redundancy-based approach to reliability for general purpose computation on GPUs that requires minimal change to existing GPU architectures. Upon detecting an error, the system invokes an automatic recovery mechanism that only recomputes erroneous results. Our results show that our technique imposes less than a 1.5 x performance penalty and saves energy for GPGPU but is completely transparent to general graphics and does not affect the performance of the games that drive the market.

References

[1]
{AMN03} Aila T., Miettinen V., Nordlund P.: Delay Streams for Graphics Hardware. ACM Transactions on Graphics 22, 3 (2003), 792--800.
[2]
{BBS*03} Brooks D., Bose P., Srinivasan V., Gschwind M., Emma P. G., Rosenfield M. G.: New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors. IBM Journal of R & D 47, 5/6 (2003).
[3]
{BK05} Brown P., Kilgard M. J.: NV_Fragment_-Program, May 2005. http://www.opengl.org/registry/-specs/NV/fragment_program.txt.
[4]
{Bor05} Borkar S.: Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro 25, 6 (Nov./Dec. 2005), 10--16.
[5]
{GLW*04} Govindaraju N. K., Lloyd B., Wang W., Lin M., Manocha D.: Fast computation of database operations using graphics processors. In SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data (New York, NY, USA, 2004), ACM Press, pp. 215--226.
[6]
{GSVP03} Gomaa M. A., Scarbrough C., Vijaykumar T. N., Pomeranz I.: Transient-Fault Recovery for Chip Multiprocessors. IEEE Micro 23, 6 (2003), 76--83.
[7]
{HKM*03} Hazucha P., Karnik T., Maiz J., Walstra S., Bloechel B., Tschanz J., Dermer G., Hareland S., Armstrong P., Borkar S.: Neutron Soft Error Rate Measurements in a 90-nm CMOS Process and Scaling Trends in SRAM from 0.25-μm to 90-nm Generation. In IEEE International Electron Devices Meeting 2003 Technical Digest (Dec. 2003), IEEE, pp. 523--526.
[8]
{Hou07} Houston M.: Personal communication, Mar. 2007. Stanford University and Folding@Home.
[9]
{HP03} Hennessy J. L., Patterson D. A.: Computer architecture: a quantitative approach, 3rd ed. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.
[10]
{HPS05} Houston M., Preetham A. J., Segal M. A.: A Hardware F-Buffer Implementation. Tech. rep., Stanford University., 2005.
[11]
{JLBM05} Johnson G. S., Lee J., Burns C. A., Mark W. R.: The Irregular Z-buffer: Hardware Acceleration for Irregular Data Structures. ACM Trans. Graph. 24, 4 (2005), 1462--1482.
[12]
{MER05} Mukherjee S. S., Emer J. S., Reinhardt S. K.: The Soft Error Problem: An Architectural Perspective. In HPCA (2005), IEEE, IEEE Computer Society, pp. 243--247.
[13]
{MKR02} Mukherjee S. S., Kontz M., Reinhardt S. K.: Detailed Design and Evaluation of Redundant Multithreading Alternatives. In ISCA (2002), IEEE, IEEE Computer Society, pp. 99--110.
[14]
{Mor66} Morton G. M.: A computer oriented geodetic data base and a new technique in file sequencing, 1966. IBM Canada.
[15]
{MP01} Mark W. R., Proudfoot K.: The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering. In Proceedings of the SIGGRAPH/Eurographics Graphics Hardware Workshop 2001 (2001).
[16]
{NVI07} NVIDIA: NVIDIA CUDA compute unified device architecture programming guide, 2007. http://developer.download.nvidia.com/compute/cuda/08/-NVIDIA_CUDA_Programming_Guide_0.8.pdf.
[17]
{PGS04} Parashar A., Gurumurthi S., Sivasubramaniam A.: A Complexity-Effective Approach to ALU Bandwidth Enhancement for Instruction-Level Temporal Redundancy. In ISCA (2004), IEEE, IEEE Computer Society, pp. 376--386.
[18]
{RCV*05} Reis G. A., Chang J., Vachharajani N., Rangan R., August D. I., Mukherjee S. S.: Design and Evaluation of Hybrid Fault-Detection Systems. In ISCA (2005), pp. 148--159.
[19]
{RM00} Reinhardt S. K., Mukherjee S. S.: Transient fault detection via simultaneous multithreading. In ISCA (2000), pp. 25--36.
[20]
{SABR04} Srinivasan J., Adve S. V., Bose P., Rivers J. A.: The Impact of Technology Scaling on Lifetime Reliability. In DSN (2004), IEEE, IEEE Computer Society.
[21]
{SLS06} Sheaffer J. W., Luebke D. P., Skadron K.: The Visual Vulnerability Spectrum: Characterizing Architectural Vulnerability for Graphics Hardware. In Proceedings of Graphics Hardware 2006 (Sept. 2006).
[22]
{Sys05} SystemC Language Reference Manual Working Group: Draft standard SystemC language reference manual (version 2.1), April 2005. http://www.systemc.org/.
[23]
{TOP94} TOP500.Org: June 1994lTOP500 Supercomputing Sites, June 1994. http://www.top500.org/lists/-1994/06.
[24]
{TTJ06} Tarjan D., Thoziyoor S., Jouppi N. P.: CACTI 4.0. Tech. Rep. HPL-2006-86, HP Laboratories Palo Alto, June 2006.
[25]
{VPC02} Vijaykumar T. N., Pomeranz I., Cheng K.: Transient-fault recovery using simultaneous multithreading. In ISCA '02: Proceedings of the 29th annual international symposium on Computer architecture (Washington, DC, USA, 2002), IEEE Computer Society, pp. 87--98.

Cited By

View all
  • (2016)PAISProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972174(1568-1573)Online publication date: 14-Mar-2016
  • (2014)Exploring the Heterogeneous Design Space for both Performance and ReliabilityProceedings of the 51st Annual Design Automation Conference10.1145/2593069.2596680(1-6)Online publication date: 1-Jun-2014
  • (2013)CPU-GPU hybrid bidiagonal reduction with soft error resilienceProceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems10.1145/2530268.2530270(1-5)Online publication date: 17-Nov-2013
  • Show More Cited By
  1. A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    GH '07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
    August 2007
    119 pages
    ISBN:9781595936257

    Sponsors

    Publisher

    Eurographics Association

    Goslar, Germany

    Publication History

    Published: 04 August 2007

    Check for updates

    Qualifiers

    • Article

    Conference

    GH07
    Sponsor:
    GH07: Graphics Hardware
    August 4 - 5, 2007
    California, San Diego

    Acceptance Rates

    GH '07 Paper Acceptance Rate 12 of 30 submissions, 40%;
    Overall Acceptance Rate 50 of 126 submissions, 40%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2016)PAISProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972174(1568-1573)Online publication date: 14-Mar-2016
    • (2014)Exploring the Heterogeneous Design Space for both Performance and ReliabilityProceedings of the 51st Annual Design Automation Conference10.1145/2593069.2596680(1-6)Online publication date: 1-Jun-2014
    • (2013)CPU-GPU hybrid bidiagonal reduction with soft error resilienceProceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems10.1145/2530268.2530270(1-5)Online publication date: 17-Nov-2013
    • (2013)Cost-effective soft-error protection for SRAM-based structures in GPGPUsProceedings of the ACM International Conference on Computing Frontiers10.1145/2482767.2482804(1-10)Online publication date: 14-May-2013
    • (2013)Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancementProceedings of the 27th international ACM conference on International conference on supercomputing10.1145/2464996.2465022(433-442)Online publication date: 10-Jun-2013
    • (2012)iGPUProceedings of the 39th Annual International Symposium on Computer Architecture10.5555/2337159.2337168(72-83)Online publication date: 9-Jun-2012
    • (2012)RISEProceedings of the 21st international conference on Parallel architectures and compilation techniques10.1145/2370816.2370846(191-200)Online publication date: 19-Sep-2012
    • (2012)iGPUACM SIGARCH Computer Architecture News10.1145/2366231.233716840:3(72-83)Online publication date: 9-Jun-2012
    • (2011)Soft error resilient QR factorization for hybrid system with GPGPUProceedings of the second workshop on Scalable algorithms for large-scale systems10.1145/2133173.2133179(11-14)Online publication date: 14-Nov-2011
    • (2010)Hard Data on Soft ErrorsProceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing10.1109/CCGRID.2010.84(691-696)Online publication date: 17-May-2010
    • Show More Cited By

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media