skip to main content
10.1145/2148600.2148625acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
poster

Poster: detection and correction of silent data corruption for large-scale high-performance computing

Published:12 November 2011Publication History

ABSTRACT

Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults will not be detected, manifesting themselves as silent errors that will corrupt memory while applications continue to operate and report incorrect results. This paper introduces RedMPI, an MPI library which resides in the MPI profiling layer. RedMPI is capable of both online detection and correction of soft errors that occur in MPI applications without requiring any modifications to the application source. By providing redundancy, RedMPI is capable of transparently detecting corrupt messages from MPI processes that become faulted during execution. Furthermore, with triple redundancy RedMPI additionally "votes" out MPI messages of a faulted process by replacing corrupted results with corrected results from unfaulted processes. We present an experimental evaluation of RedMPI on an assortment of applications to demonstrate the effectiveness of this approach.

Skip Supplemental Material Section

Supplemental Material

Index Terms

  1. Poster: detection and correction of silent data corruption for large-scale high-performance computing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
      November 2011
      166 pages
      ISBN:9781450310307
      DOI:10.1145/2148600

      Copyright © 2011 Authors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 November 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • poster

      Acceptance Rates

      Overall Acceptance Rate1,516of6,373submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader