ACM Home Page
Please provide us with feedback. Feedback
SFT: scalable fault tolerance
Full text PdfPdf (294 KB)
Source ACM SIGOPS Operating Systems Review archive
Volume 40 ,  Issue 2  (April 2006) table of contents
COLUMN: Operating and runtime systems for high-end computing systems table of contents
Pages: 55 - 62  
Year of Publication: 2006
ISSN:0163-5980
Authors
Fabrizio Petrini  Pacific Northwest National Laboratory, Richland, WA
Jarek Nieplocha  Pacific Northwest National Laboratory, Richland, WA
Vinod Tipparaju  Pacific Northwest National Laboratory, Richland, WA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 36,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1131322.1131336
What is a DOI?

ABSTRACT

In this paper we will present a new technology that we are currently developing within the SFT: Scalable Fault Tolerance FastOS project which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency---requiring no changes to user applications. Our technology is based on a global coordination mechanism, that enforces transparent recovery lines in the system, and TICK, a lightweight, incremental checkpointing software architecture implemented as a Linux kernel module. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5μs; and it supports incremental and full checkpoints with minimal overhead---less than 6% with full checkpointing to disk performed as frequently as once per minute.



Collaborative Colleagues:
Fabrizio Petrini: colleagues
Jarek Nieplocha: colleagues
Vinod Tipparaju: colleagues