ABSTRACT
Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification by utilizing the MMU. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with less than 100% overhead of resources.
Supplemental Material
Available for Download
Index Terms
- Poster: a tunable, software-based DRAM error detection and correction library for HPC
Recommendations
Poster: detection and correction of silent data corruption for large-scale high-performance computing
SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis CompanionFaults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults will not be detected, manifesting themselves as silent errors that will ...
LADR: low-cost application-level detector for reducing silent output corruptions
HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed ComputingApplications running on future high performance computing (HPC) systems are more likely to experience transient faults due to technology scaling trends with respect to higher circuit density, smaller transistor size and near-threshold voltage (NTV) ...
Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults
ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating SystemsFuture microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost monitors of software-level symptoms of such faults. Recently, ...
Comments