skip to main content
10.1145/1244002.1244127acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
Article

Evaluation of the QoS of crash-recovery failure detection

Published:11 March 2007Publication History

ABSTRACT

Crash failure detection is a key topic in fault tolerance, and it is important to be able to assess the QoS of failure detection services. Most previous work on crash failure detectors has been based on the crash-stop or fail-free assumption. In this paper we study and model a crash-recovery service which has the ability to recover from the crash state. We analyse the QoS bounds for such a crash-recovery failure detection service. Our results show that the dependability metrics of the monitored service will have an impact on the QoS of the failure detection service. Our results are corroborated by simulation results, showing bounds on the QoS.

References

  1. M. K. Aguilera, W. Chen, and S. Toueg. Failure detection and consensus in the crash-recovery model. In Int. Sym. on Distributed Computing, pages 231--245, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. K. Aguilera, W. Chen, and S. Toueg. Using the heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks. Theoretical Computer Science, 220(1):3--30, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225--267, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. volume 51, pages 13--32, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Dolev, R. Friedman, I. Keidar, and D. Malkhi. Failure detectors in omission failure environments. In Proc. of the 16th Annual ACM Sym. on Principles of Distributed Computing, page 286, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Falai and A. Bondavalli. Experimental evaluation of the qos of failure detectors on wide area network. In 2005 Int. Conf. on Dependable Systems and Networks, pages 624--633, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Fetzer. Perfect failure detection in timed asynchronous systems. IEEE Trans. Comput., 52(2):99--112, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Fetzer, M. Raynal, and F. Tronel. An adaptive failure detection protocol. In Proc. of the 2001 Pacific Rim Int. Sym. on Dependable Computing, page 146, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374--382, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. V. K. Garg and J. R. Mitchell. Implementable failure detectors in asynchronous systems. In Proc. 18th Conf. on Foundations of Software Technology and Theoretical Computer Science, number 1530, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. I. Gupta, T. D. Chandra, and G. S. Goldszmidt. On scalable and efficient distributed failure detectors. In Proc. of the 20th Annual ACM Sym. on Principles of Distributed Computing, pages 170--179, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Hayashibara, A. Cherif, and T. Katayama. Failure detectors for large-scale distributed systems. In Proc. of the 21st IEEE Sym. on Reliable Distributed Systems, page 404, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Hayashibara, X. Defago, R. Yared, and T. Katayama. The accrual failure detector. In 23rd IEEE Int. Sym. on Reliable Distributed Systems, pages 66--78, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Hurfin, A. Mostefaoui, and M. Raynal;. Consensus in asynchronous systems where processes can crash and recover. In The 17th IEEE Sym. on Reliable Distributed Systems, pages 280--286, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Neiger. Failure detectors and the wait-free hierarchy (extended abstract). In Proc. of the 14th Annual ACM Sym. on Principles of Distributed Computing, pages 100--109, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. C. Nunes and I. Jansch-Porto. Qos of timeout-based self-tuned failure detectors: The effects of the communication delay predictor and the safety margin. In 2004 Int. Conf. on Dependable Systems and Networks, page 753, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Oliveira, R. Guerraoui, and A. Schiper. Consensus in the crash-recover model. Technical Report TR-97/239, 1997.Google ScholarGoogle Scholar
  18. R. V. Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. Technical Report TR98--1687, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. Sotoma and E. R. M. Madeira. A markov model for quality of service of failure detectors in the pressure of loss bursts. In 18th Int. Conf. on Advanced Information Networking and Applications, volume 2, page 62, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Stelling, C. DeMatteis, I. Foster, C. Kesselman, C. A. Lee, and G. von Laszewski. A fault detection service for wide area distributed computations. Cluster Computing, 2(2):117--128, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evaluation of the QoS of crash-recovery failure detection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SAC '07: Proceedings of the 2007 ACM symposium on Applied computing
      March 2007
      1688 pages
      ISBN:1595934804
      DOI:10.1145/1244002

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 March 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,650of6,669submissions,25%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader