ABSTRACT
Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults will not be detected, manifesting themselves as silent errors that will corrupt memory while applications continue to operate and report incorrect results. This paper introduces RedMPI, an MPI library which resides in the MPI profiling layer. RedMPI is capable of both online detection and correction of soft errors that occur in MPI applications without requiring any modifications to the application source. By providing redundancy, RedMPI is capable of transparently detecting corrupt messages from MPI processes that become faulted during execution. Furthermore, with triple redundancy RedMPI additionally "votes" out MPI messages of a faulted process by replacing corrupted results with corrected results from unfaulted processes. We present an experimental evaluation of RedMPI on an assortment of applications to demonstrate the effectiveness of this approach.
Supplemental Material
Available for Download
Index Terms
- Poster: detection and correction of silent data corruption for large-scale high-performance computing
Recommendations
LADR: low-cost application-level detector for reducing silent output corruptions
HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed ComputingApplications running on future high performance computing (HPC) systems are more likely to experience transient faults due to technology scaling trends with respect to higher circuit density, smaller transistor size and near-threshold voltage (NTV) ...
Poster: a tunable, software-based DRAM error detection and correction library for HPC
SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis CompanionProposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-...
Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults
ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating SystemsFuture microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost monitors of software-level symptoms of such faults. Recently, ...
Comments