ABSTRACT
Crash failure detection is a key topic in fault tolerance, and it is important to be able to assess the QoS of failure detection services. Most previous work on crash failure detectors has been based on the crash-stop or fail-free assumption. In this paper we study and model a crash-recovery service which has the ability to recover from the crash state. We analyse the QoS bounds for such a crash-recovery failure detection service. Our results show that the dependability metrics of the monitored service will have an impact on the QoS of the failure detection service. Our results are corroborated by simulation results, showing bounds on the QoS.
- M. K. Aguilera, W. Chen, and S. Toueg. Failure detection and consensus in the crash-recovery model. In Int. Sym. on Distributed Computing, pages 231--245, 1998. Google ScholarDigital Library
- M. K. Aguilera, W. Chen, and S. Toueg. Using the heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks. Theoretical Computer Science, 220(1):3--30, 1999. Google ScholarDigital Library
- T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225--267, 1996. Google ScholarDigital Library
- W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. volume 51, pages 13--32, 2002. Google ScholarDigital Library
- D. Dolev, R. Friedman, I. Keidar, and D. Malkhi. Failure detectors in omission failure environments. In Proc. of the 16th Annual ACM Sym. on Principles of Distributed Computing, page 286, 1997. Google ScholarDigital Library
- L. Falai and A. Bondavalli. Experimental evaluation of the qos of failure detectors on wide area network. In 2005 Int. Conf. on Dependable Systems and Networks, pages 624--633, 2005. Google ScholarDigital Library
- C. Fetzer. Perfect failure detection in timed asynchronous systems. IEEE Trans. Comput., 52(2):99--112, 2003. Google ScholarDigital Library
- C. Fetzer, M. Raynal, and F. Tronel. An adaptive failure detection protocol. In Proc. of the 2001 Pacific Rim Int. Sym. on Dependable Computing, page 146, 2001. Google ScholarDigital Library
- M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374--382, 1985. Google ScholarDigital Library
- V. K. Garg and J. R. Mitchell. Implementable failure detectors in asynchronous systems. In Proc. 18th Conf. on Foundations of Software Technology and Theoretical Computer Science, number 1530, 1998. Google ScholarDigital Library
- I. Gupta, T. D. Chandra, and G. S. Goldszmidt. On scalable and efficient distributed failure detectors. In Proc. of the 20th Annual ACM Sym. on Principles of Distributed Computing, pages 170--179, 2001. Google ScholarDigital Library
- N. Hayashibara, A. Cherif, and T. Katayama. Failure detectors for large-scale distributed systems. In Proc. of the 21st IEEE Sym. on Reliable Distributed Systems, page 404, 2002. Google ScholarDigital Library
- N. Hayashibara, X. Defago, R. Yared, and T. Katayama. The accrual failure detector. In 23rd IEEE Int. Sym. on Reliable Distributed Systems, pages 66--78, 2004. Google ScholarDigital Library
- M. Hurfin, A. Mostefaoui, and M. Raynal;. Consensus in asynchronous systems where processes can crash and recover. In The 17th IEEE Sym. on Reliable Distributed Systems, pages 280--286, 1998. Google ScholarDigital Library
- G. Neiger. Failure detectors and the wait-free hierarchy (extended abstract). In Proc. of the 14th Annual ACM Sym. on Principles of Distributed Computing, pages 100--109, 1995. Google ScholarDigital Library
- R. C. Nunes and I. Jansch-Porto. Qos of timeout-based self-tuned failure detectors: The effects of the communication delay predictor and the safety margin. In 2004 Int. Conf. on Dependable Systems and Networks, page 753, 2004. Google ScholarDigital Library
- R. Oliveira, R. Guerraoui, and A. Schiper. Consensus in the crash-recover model. Technical Report TR-97/239, 1997.Google Scholar
- R. V. Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. Technical Report TR98--1687, 1998. Google ScholarDigital Library
- I. Sotoma and E. R. M. Madeira. A markov model for quality of service of failure detectors in the pressure of loss bursts. In 18th Int. Conf. on Advanced Information Networking and Applications, volume 2, page 62, 2004. Google ScholarDigital Library
- P. Stelling, C. DeMatteis, I. Foster, C. Kesselman, C. A. Lee, and G. von Laszewski. A fault detection service for wide area distributed computations. Cluster Computing, 2(2):117--128, 1999. Google ScholarDigital Library
Index Terms
- Evaluation of the QoS of crash-recovery failure detection
Recommendations
On the Quality of Service of Crash-Recovery Failure Detectors
We model the probabilistic behavior of a system comprising a failure detector and a monitored crash-recovery target. We extend failure detectors to take account of failure recovery in the target system. This involves extending QoS measures to include ...
On the Quality of Service of Crash-Recovery Failure Detectors
DSN '07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and NetworksIn this paper, we study and model a crash-recovery target and its failure detector's probabilistic behavior. We extend Quality of Service (QoS) metrics to measure the recovery detection speed and the proportion of the detected failures of a crash-...
Failure detection and consensus in the crash-recovery model
We study the problems of failure detection and consensus in asynchronous systems in which processes may crash and recover, and links may lose messages. We first propose new failure detectors that are particularly suitable to the crash-recovery model. We ...
Comments