ACM Home Page
Please provide us with feedback. Feedback
Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++
Full text PdfPdf (697 KB)
Source ACM SIGOPS Operating Systems Review archive
Volume 40 ,  Issue 2  (April 2006) table of contents
COLUMN: Operating and runtime systems for high-end computing systems table of contents
Pages: 90 - 99  
Year of Publication: 2006
ISSN:0163-5980
Authors
Gengbin Zheng  University of Illinois at Urbana-Champaign
Chao Huang  University of Illinois at Urbana-Champaign
Laxmikant V. Kalé  University of Illinois at Urbana-Champaign
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 2,   Downloads (12 Months): 66,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1131322.1131340
What is a DOI?

ABSTRACT

As the size of high performance clusters multiplies, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a fault occurs, the application is restarted from a recent checkpoint. However, the application developer is required to write significant additional code for checkpointing and restarting. This paper describes disk-based and memory-based checkpointing fault tolerance schemes that automate the task of checkpointing and restarting. The schemes also allow the program to be restarted on a different number of processors. These schemes are based on self-checkpointable, migratable objects supported by the Adaptive MPI (AMPI) and Charm++ run-time and can be applied to a wide class of applications written using MPI or message-driven languages. We demonstrate the effectiveness of the strategies and evaluate their performance.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
NR Adiga, G Almasi, GS Almasi, Y Aridor, R Barik, D Beece, R Bellofatto, G Bhanot, R Bickford, M Blumrich, AA Bright, and J. An overview of the bluegene/1 supercomputer, 2002.
 
2
 
3
 
4
 
5
 
6
D. Briatico, A. Ciuffoletti, and L. Simoncini. A distributed domino-effect free recovery algorithm. In IEEE International Symposium on Reliability, Distributed Software, and Databases, pages 207--215, December 1984.
7
8
 
9
Charm++ website. http://charm.cs.uiuc.edu/.
 
10
Yuqun Chen, Kai Li, and James S. Plank. CLIP: A checkpointing tool for message-passing parallel programs. 1997.
 
11
Epcc blue gene/1. http://www.epcc.ed.ac.uk/.
 
12
Chao Huang. System support for checkpoint and restart of charm++ and ampi applications. Master's thesis, Dept. of Computer Science, University of Illinois, 2004.
 
13
Chao Huang, Orion Lawlor, and L. V. Kalé. Adaptive MPI. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), LNCS 2958, pages 306--322, College Station, Texas, October 2003.
14
 
15
Rashmi Jyothi, Orion Sky Lawlor, and L. V. Kale. Debugging support for Charm++. In PADTAD Workshop for IPDPS 2004, page 294. IEEE Press, 2004.
 
16
Laxmikant V. Kalé. The virtualization model of parallel programming: Runtime optimizations and the state of art. In LACSI 2002, Albuquerque, October 2002.
 
17
 
18
James S. Plank and Kai Li. Faster checkpointing with n+1 parity. In 24th Annual International Symposium on Fault-Tolerant Computing, June 1994.
 
19
B. Randell. System structure for software fault-tolerance. In IEEE Trans. on Software on Software Engineering, volume SE-1 (2), pages 226--232, June 1975.
 
20
 
21
Y. Tamir and C. Equin. Error recovery in multicomputers using global checkpoints. In 13th International Conference on Parallel Processing, pages 32--41, August 1984.
 
22
Turing cluster. http://www.cse.uiuc.edu/turing.
 
23
 
24
Gengbin Zheng. Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005.
 
25

Collaborative Colleagues:
Gengbin Zheng: colleagues
Chao Huang: colleagues
Laxmikant V. Kalé: colleagues