Article

Improving fault resilience of high performance applications

Authors:

Yawei Li,

Zhiling LanAuthors Info & Claims

SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

Pages 155 - es

https://doi.org/10.1145/1188455.1188616

Published: 11 November 2006 Publication History

Get Access

Abstract

For large-scale systems with hundreds to thousands of nodes, failures are likely to be more frequent as the system reliability decreases exponentially with the increasing count of components. Many parallel applications that span a large number of nodes are designed to run for days or weeks until completion. Hence, application-level fault resilience is of critical importance to the continued scaling of high performance computing (HPC). In this poster, we present and evaluate an adaptive fault resilience framework for HPC applications which adaptively selects an optimal corrective or preventive action based upon failure predictions at runtime. The proposed framework is implemented with a production-level MPI package and assessed with a variety of real-world parallel applications on production HPC systems. The experiment results demonstrate promising performance improvement of FT-Pro against traditional checkpointing/recovery schemes under a wide range of prediction accuracies and application characteristics.

Index Terms

Improving fault resilience of high performance applications

Recommendations

Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
ICPP '07: Proceedings of the 2007 International Conference on Parallel Processing

The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing. However, existing research shows that such a reactive fault tolerance ...
Adaptive Fault Management of Parallel Applications for High-Performance Computing

As the scale of high performance computing (HPC) grows, application fault resilience becomes increasingly important. In this paper, we propose FT-Pro, an adaptive fault management approach that combines the merits of reactive checkpointing and proactive ...
Scalable techniques for fault tolerant high performance computing

Comments

Information & Contributors

Information

Published In

SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

November 2006

746 pages

ISBN:0769527000

DOI:10.1145/1188455

Conference Chair:
Barbara Horner-Miller
Arctic Region Supercomputing Center

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SC '06

Sponsor:

SIGARCH
IEEE-CS

SC '06: International Conference for High Performance Computing, Networking, Storage and Analysis

November 11 - 17, 2006

Florida, Tampa

Acceptance Rates

SC '06 Paper Acceptance Rate 54 of 239 submissions, 23%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
65
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

Adaptive Fault Management of Parallel Applications for High-Performance Computing

Scalable techniques for fault tolerant high performance computing

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Other Metrics

Article Metrics

Other Metrics

Login options

Full Access

HTML Format

Abstract

Index Terms

Recommendations

Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

Adaptive Fault Management of Parallel Applications for High-Performance Computing

Scalable techniques for fault tolerant high performance computing

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations