|
ABSTRACT
Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from "unhealthy" nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Ganglia. http://ganglia.sourceforge.net/.
|
| |
2
|
OpenIPMI. http://openipmi.sourceforge.net/.
|
| |
3
|
Advanced configuration & power interface. http://www.acpi.info/, 2004.
|
| |
4
|
R. T. Aulwes, D. J. Daniel, N. N. Desai, R. L. Graham, L. D. Risinger, M. A. Taylor, T. S. Woodall, and M. W. Sukalski. Architecture of LA-MPI, a network-fault-tolerant MPI. In International Parallel and Distributed Processing Symposium, 2004.
|
| |
5
|
A. Barak and R. Wheeler. MOSIX: An integrated multiprocessor UNIX. In USENIX Association, editor, Proceedings of the Winter 1989 USENIX Conference: January 30--February 3, 1989, San Diego, California, USA, pages 101--112, Berkeley, CA, USA, Winter 1989. USENIX.
|
 |
6
|
Paul Barham , Boris Dragovic , Keir Fraser , Steven Hand , Tim Harris , Alex Ho , Rolf Neugebauer , Ian Pratt , Andrew Warfield, Xen and the art of virtualization, Proceedings of the nineteenth ACM symposium on Operating systems principles, October 19-22, 2003, Bolton Landing, NY, USA
|
| |
7
|
George Bosilca , Aurelien Bouteiller , Franck Cappello , Samir Djilali , Gilles Fedak , Cecile Germain , Thomas Herault , Pierre Lemarinier , Oleg Lodygensky , Frederic Magniette , Vincent Neri , Anton Selikhov, MPICH-V: toward a scalable fault tolerant MPI for volatile nodes, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, p.1-18, November 16, 2002, Baltimore, Maryland
|
| |
8
|
|
| |
9
|
S. Chakravorty, C. Mendes, and L. Kale. Proactive fault tolerance in large systems. In HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA-11). IEEE Computer Society, 2005.
|
| |
10
|
S. Chakravorty, C. Mendes, and L. Kale. Proactive fault tolerance in mpi applications via task migration. In International Conference on High Performance Computing, 2006.
|
| |
11
|
S. Chakravorty, C. Mendes, and L. Kale. A fault tolerance protocol with fast fault recovery. In International Parallel and Distributed Processing Symposium, 2007.
|
| |
12
|
Christopher Clark , Keir Fraser , Steven Hand , Jacob Gorm Hansen , Eric Jul , Christian Limpach , Ian Pratt , Andrew Warfield, Live migration of virtual machines, Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation, p.273-286, May 02-04, 2005
|
| |
13
|
|
| |
14
|
J. Duell. The design and implementation of berkeley lab's linux checkpoint/restart. Tr, Lawrence Berkeley National Laboratory, 2000.
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
 |
18
|
|
 |
19
|
Hermann Härtig , Michael Hohmuth , Jochen Liedtke , Sebastian Schönberg , Jean Wolter, The performance of μ-kernel-based systems, Proceedings of the sixteenth ACM symposium on Operating systems principles, p.66-77, October 05-08, 1997, Saint Malo, France
|
| |
20
|
|
 |
21
|
Wei Huang , Jiuxing Liu , Bulent Abali , Dhabaleswar K. Panda, A case for high performance computing with virtual machines, Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006, Cairns, Queensland, Australia
[doi> 10.1145/1183401.1183421]
|
| |
22
|
IBM T.J. Watson. Personal communications. Ruud Haring, July 2005.
|
 |
23
|
|
| |
24
|
|
| |
25
|
Jiuxing Liu , Wei Huang , Bulent Abali , Dhabaleswar K. Panda, High performance VMM-bypass I/O in virtual machines, Proceedings of the Annual Technical Conference on USENIX'06 Annual Technical Conference, p.3-3, May 30-June 03, 2006, Boston, MA
|
| |
26
|
|
| |
27
|
A. Oliner, R. Sahoo, J. Moreira, M. Gupta, and A. Sivasubramaniam. Fault-aware job scheduling for bluegene/l systems. In International Parallel and Distributed Processing Symposium, 2004.
|
 |
28
|
|
 |
29
|
|
| |
30
|
I. Philp. Software failures and the road to a petaflop machine. In HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA-11). IEEE Computer Society, 2005.
|
 |
31
|
|
| |
32
|
S. Rani, C. Leangsuksun, A. Tikotekar, V. Rampure, and S. Scott. Toward efficient failre detection and recovery in hpc. In High Availability and Performance Computing Workshop, page (accepted), 2006.
|
 |
33
|
R. K. Sahoo , A. J. Oliner , I. Rish , M. Gupta , J. E. Moreira , S. Ma , R. Vilalta , A. Sivasubramaniam, Critical event prediction for proactive management in large-scale computer clusters, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2003, Washington, D.C.
[doi> 10.1145/956750.956799]
|
| |
34
|
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In Proceedings, LACSI Symposium, Oct. 2003.
|
 |
35
|
Constantine P. Sapuntzakis , Ramesh Chandra , Ben Pfaff , Jim Chow , Monica S. Lam , Mendel Rosenblum, Optimizing the migration of virtual computers, Proceedings of the 5th symposium on Operating systems design and implementation Due to copyright restrictions we are not able to make the PDFs for this conference available for downloading, December 09-11, 2002, Boston, Massachusetts
[doi> 10.1145/1060289.1060324]
|
| |
36
|
|
| |
37
|
|
| |
38
|
|
 |
39
|
|
| |
40
|
C. Wang, F. Mueller, C. Engelmann, and S. Scott. A job pause service under lam/mpi+blcr for transparent fault tolerance. In International Parallel and Distributed Processing Symposium, page (accepted), Apr. 2007.
|
| |
41
|
Andrew Whitaker , Richard S. Cox , Marianne Shaw , Steven D. Grible, Constructing services with interposable virtual hardware, Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation, p.13-13, March 29-31, 2004, San Francisco, California
|
 |
42
|
Frederick C. Wong , Richard P. Martin , Remzi H. Arpaci-Dusseau , David E. Culler, Architectural requirements and scalability of the NAS parallel benchmarks, Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), p.41-es, November 14-19, 1999, Portland, Oregon, United States
[doi> 10.1145/331532.331573]
|
 |
43
|
|
|