ABSTRACT
Data processing systems are liable to both hardware and system software failure. In first and second generation systems the impact of such failures was typically limited by the scope of the system itself to the one or limited few programs operating at the time. Resumption from the beginning of the program or preplanned checkpoint typically constituted complete recovery.
Recommendations
Dealing with failures during failure recovery of distributed systems
One of the characteristics of autonomic systems is self recovery from failures. Self recovery can be achieved through sensing failures, planning for recovery and executing the recovery plan to bring the system back to a normal state. For various reasons,...
Modeling of Correlated Failures and Community Error Recovery in Multiversion Software
Three aspects of the modeling of multiversion software are considered. First, the beta-binomial distribution is proposed for modeling correlated failures in multiversion software. Second, a combinatorial model for predicting the reliability of a ...
A New Diskless Checkpointing Approach for Multiple Processor Failures
Diskless checkpointing is an important technique for performing fault tolerance in distributed or parallel computing systems. This study proposes a new approach to enhance neighbor-based diskless checkpointing to tolerate multiple failures using simple ...
Comments