|
|||||||||||||||||||||
|
|||||||||||||||||||||
ABSTRACT
For large-scale systems with hundreds to thousands of nodes, failures are likely to be more frequent as the system reliability decreases exponentially with the increasing count of components. Many parallel applications that span a large number of nodes are designed to run for days or weeks until completion. Hence, application-level fault resilience is of critical importance to the continued scaling of high performance computing (HPC). In this poster, we present and evaluate an adaptive fault resilience framework for HPC applications which adaptively selects an optimal corrective or preventive action based upon failure predictions at runtime. The proposed framework is implemented with a production-level MPI package and assessed with a variety of real-world parallel applications on production HPC systems. The experiment results demonstrate promising performance improvement of FT-Pro against traditional checkpointing/recovery schemes under a wide range of prediction accuracies and application characteristics. INDEX TERMS
Primary Classification:
Additional Classification:
|
|||||||||||||||||||||