|
ABSTRACT
As the size of high performance clusters multiplies, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a fault occurs, the application is restarted from a recent checkpoint. However, the application developer is required to write significant additional code for checkpointing and restarting. This paper describes disk-based and memory-based checkpointing fault tolerance schemes that automate the task of checkpointing and restarting. The schemes also allow the program to be restarted on a different number of processors. These schemes are based on self-checkpointable, migratable objects supported by the Adaptive MPI (AMPI) and Charm++ run-time and can be applied to a wide class of applications written using MPI or message-driven languages. We demonstrate the effectiveness of the strategies and evaluate their performance.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
NR Adiga, G Almasi, GS Almasi, Y Aridor, R Barik, D Beece, R Bellofatto, G Bhanot, R Bickford, M Blumrich, AA Bright, and J. An overview of the bluegene/1 supercomputer, 2002.
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
D. Briatico, A. Ciuffoletti, and L. Simoncini. A distributed domino-effect free recovery algorithm. In IEEE International Symposium on Reliability, Distributed Software, and Databases, pages 207--215, December 1984.
|
 |
7
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Paul Stodghill, Automated application-level checkpointing of MPI programs, Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 11-13, 2003, San Diego, California, USA
|
 |
8
|
|
| |
9
|
Charm++ website. http://charm.cs.uiuc.edu/.
|
| |
10
|
Yuqun Chen, Kai Li, and James S. Plank. CLIP: A checkpointing tool for message-passing parallel programs. 1997.
|
| |
11
|
Epcc blue gene/1. http://www.epcc.ed.ac.uk/.
|
| |
12
|
Chao Huang. System support for checkpoint and restart of charm++ and ampi applications. Master's thesis, Dept. of Computer Science, University of Illinois, 2004.
|
| |
13
|
Chao Huang, Orion Lawlor, and L. V. Kalé. Adaptive MPI. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), LNCS 2958, pages 306--322, College Station, Texas, October 2003.
|
 |
14
|
Chao Huang , Gengbin Zheng , Laxmikant Kalé , Sameer Kumar, Performance evaluation of adaptive MPI, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, March 29-31, 2006, New York, New York, USA
[doi> 10.1145/1122971.1122976]
|
| |
15
|
Rashmi Jyothi, Orion Sky Lawlor, and L. V. Kale. Debugging support for Charm++. In PADTAD Workshop for IPDPS 2004, page 294. IEEE Press, 2004.
|
| |
16
|
Laxmikant V. Kalé. The virtualization model of parallel programming: Runtime optimizations and the state of art. In LACSI 2002, Albuquerque, October 2002.
|
| |
17
|
|
| |
18
|
James S. Plank and Kai Li. Faster checkpointing with n+1 parity. In 24th Annual International Symposium on Fault-Tolerant Computing, June 1994.
|
| |
19
|
B. Randell. System structure for software fault-tolerance. In IEEE Trans. on Software on Software Engineering, volume SE-1 (2), pages 226--232, June 1975.
|
| |
20
|
|
| |
21
|
Y. Tamir and C. Equin. Error recovery in multicomputers using global checkpoints. In 13th International Conference on Parallel Processing, pages 32--41, August 1984.
|
| |
22
|
Turing cluster. http://www.cse.uiuc.edu/turing.
|
| |
23
|
|
| |
24
|
Gengbin Zheng. Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005.
|
| |
25
|
|
|