Improving fault resilience of high performance applications
Abstract
Index Terms
- Improving fault resilience of high performance applications
Recommendations
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
ICPP '07: Proceedings of the 2007 International Conference on Parallel ProcessingThe productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing. However, existing research shows that such a reactive fault tolerance ...
Adaptive Fault Management of Parallel Applications for High-Performance Computing
As the scale of high performance computing (HPC) grows, application fault resilience becomes increasingly important. In this paper, we propose FT-Pro, an adaptive fault management approach that combines the merits of reactive checkpointing and proactive ...
Comments
Information & Contributors
Information
Published In
Sponsors
- SIGARCH: ACM Special Interest Group on Computer Architecture
- IEEE-CS: Computer Society
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Check for updates
Qualifiers
- Article
Conference
- SIGARCH
- IEEE-CS
Acceptance Rates
Upcoming Conference
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 65Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
Citations
View Options
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign inFull Access
View options
HTML Format
View this article in HTML Format.
HTML Format