ACM Home Page
Please provide us with feedback. Feedback
Fast and transparent recovery for continuous availability of cluster-based servers
Full text PdfPdf (111 KB)
Source Principles and Practice of Parallel Programming archive
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming table of contents
New York, New York, USA
SESSION: Potpourri table of contents
Pages: 221 - 229  
Year of Publication: 2006
ISBN:1-59593-189-9
Authors
Rosalia Christodoulopoulou  University of Toronto, Canada
Kaloian Manassiev  University of Toronto, Canada
Angelos Bilas  University of Crete, Greece
Cristiana Amza  University of Toronto, Canada
Sponsors
ACM: Association for Computing Machinery
SIGPLAN: ACM Special Interest Group on Programming Languages
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 113,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1122971.1123005
What is a DOI?

ABSTRACT

Recently there has been renewed interest in building reliable servers that support continuous application operation. Besides maintaining system state consistent after a failure, one of the main challenges in achieving continuous operation is to provide fast reconfiguration. The complexity of the failure reconfiguration mechanisms employed and their overheads depend on the type of platform that is being used as a server and the types of applications that need to be supported. In this paper we focus on providing support for shared-memory applications running on clusters of commodity nodes and interconnects. Achieving continuous operation for shared memory applications on clusters presents two main challenges. (a) The fault tolerance mechanisms employed should be transparent to applications and should have low overhead during failure-free execution. (b) When failures occur, reconfiguration should occur with minimum application disruption without requiring the full recovery of the failed node.In this work we examine in detail the latter, i.e., (b), the failure reconfiguration path. We use a previously developed system [8] that achieves (a) by using dynamic replication of data to the memories of multiple nodes of the system during execution. We examine in detail how the runtime system can achieve minimum application interruption, when failures occur. We present the design and implementation of FineFRC (Fine-grained Failure Recon guration on Clusters), a runtime system for achieving continuous operation of shared memory applications on commodity clusters without requiring application instrumentation or human intervention. We present results using a working, 16-processor system that achieves sub-second failure reconfiguration times.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
J. Bartlett, W. Bartlett, R. Carr, D. Garcia, J. G. R. Horst, R. Jardine, D. Lenoski, and D. McGuire. Fault tolerance in Tandem computer systems. Technical Report TR-90.5, Tandem, 1990.
 
4
 
5
A. Bilas, C. Liao, and J. P. Singh. Accelerating shared virtual memory using commodity ni support to avoid asynchronous message handling. In The 26th Int'l Symposium on Computer Architecture, May 1999.
 
6
7
 
8
 
9
V. S. Corp. Veritas firstwatch. http://www.veritas.com.
10
 
11
C. Dubnicki, A. Bilas, Y. Chen, S. Damianakis, and K. Li. Vmmc-2: Efficient support for reliable, connection-oriented communication. In Proc. of the Hot Interconnects Symposium V, Aug. 1997.
12
 
13
IBM. High availability with DB2 UDB and Steeleye Lifekeeper. IBM Center for Advanced Studies Conference (CASCON): Technology Showcase, Toronto, Canada, Oct 2003.
14
15
 
16
P. Keleher, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the 1994 Winter Usenix Conference, pages 115--131, Jan. 1994.
 
17
 
18
J. Kim and N. Vaidya. Analysis of failure recovery schemes for distributed shared-memory systems. IEEE Computers and Digital Techniques, 146(3), May 1999.
 
19
K. Li. Ivy: A shared virtual memory system for parallel computing. Proceedings of the 1988 International Conference on Parallel Processing, 2:94--101, August 1988.
20
 
21
 
22
NCR Lifekeeper. http://www.ncr.com.
23
24
25
 
26
M. Stumm and S. Zhou. Fault tolerant distributed shared memory algorithms. In Proc. of the 2nd IEEE Symposium on Parallel and Distributed Processing, pages 719--724, December 1990.
 
27
 
28
VMware. Vmware ESX Server Storage Area Networks. http://www.vmware.com/, 2003.
 
29
 
30
S. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. Methodological considerations and characterization of the SPLASH-2 parallel application suite. In Proceedings of the 23rd Int'l Symposium on Computer Architecture, May 1995.
31
32
 
33
Transaction Processing Performance Council. TPC Benchmark B Standard Specification, August 1990.
 
34
Transaction Processing Performance Council. TPC Benchmark C Standard Specification, August 1996.


Collaborative Colleagues:
Rosalia Christodoulopoulou: colleagues
Kaloian Manassiev: colleagues
Angelos Bilas: colleagues
Cristiana Amza: colleagues