ACM Home Page
Please provide us with feedback. Feedback
Improving fault resilience of high performance applications
Full text HtmlHtml (2 KB)
Source Conference on High Performance Networking and Computing archive
Proceedings of the 2006 ACM/IEEE conference on Supercomputing table of contents
Tampa, Florida
POSTER SESSION: Posters table of contents
Article No. 155  
Year of Publication: 2006
ISBN:0-7695-2700-0
Authors
Sponsors
IEEE : Institute of Electrical and Electronics Engineers
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 2,   Downloads (12 Months): 9,   Citation Count: 0
Additional Information:

abstract   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1188455.1188616
What is a DOI?

ABSTRACT

For large-scale systems with hundreds to thousands of nodes, failures are likely to be more frequent as the system reliability decreases exponentially with the increasing count of components. Many parallel applications that span a large number of nodes are designed to run for days or weeks until completion. Hence, application-level fault resilience is of critical importance to the continued scaling of high performance computing (HPC). In this poster, we present and evaluate an adaptive fault resilience framework for HPC applications which adaptively selects an optimal corrective or preventive action based upon failure predictions at runtime. The proposed framework is implemented with a production-level MPI package and assessed with a variety of real-world parallel applications on production HPC systems. The experiment results demonstrate promising performance improvement of FT-Pro against traditional checkpointing/recovery schemes under a wide range of prediction accuracies and application characteristics.