Article

Evaluation of the QoS of crash-recovery failure detection

Authors:
Tiejun Ma

The University of Edinburgh, Edinburgh, UK

The University of Edinburgh, Edinburgh, UK
View Profile

,
Jane Hillston

The University of Edinburgh, Edinburgh, UK

The University of Edinburgh, Edinburgh, UK
View Profile

,
Stuart Anderson

The University of Edinburgh, Edinburgh, UK

The University of Edinburgh, Edinburgh, UK
View Profile

SAC '07: Proceedings of the 2007 ACM symposium on Applied computingMarch 2007Pages 538–542https://doi.org/10.1145/1244002.1244127

Published:11 March 2007Publication History

SAC '07: Proceedings of the 2007 ACM symposium on Applied computing

Pages 538–542

ABSTRACT

Crash failure detection is a key topic in fault tolerance, and it is important to be able to assess the QoS of failure detection services. Most previous work on crash failure detectors has been based on the crash-stop or fail-free assumption. In this paper we study and model a crash-recovery service which has the ability to recover from the crash state. We analyse the QoS bounds for such a crash-recovery failure detection service. Our results show that the dependability metrics of the monitored service will have an impact on the QoS of the failure detection service. Our results are corroborated by simulation results, showing bounds on the QoS.

References

M. K. Aguilera, W. Chen, and S. Toueg. Failure detection and consensus in the crash-recovery model. In Int. Sym. on Distributed Computing, pages 231--245, 1998. Google ScholarDigital Library
M. K. Aguilera, W. Chen, and S. Toueg. Using the heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks. Theoretical Computer Science, 220(1):3--30, 1999. Google ScholarDigital Library
T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225--267, 1996. Google ScholarDigital Library
W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. volume 51, pages 13--32, 2002. Google ScholarDigital Library
D. Dolev, R. Friedman, I. Keidar, and D. Malkhi. Failure detectors in omission failure environments. In Proc. of the 16th Annual ACM Sym. on Principles of Distributed Computing, page 286, 1997. Google ScholarDigital Library
L. Falai and A. Bondavalli. Experimental evaluation of the qos of failure detectors on wide area network. In 2005 Int. Conf. on Dependable Systems and Networks, pages 624--633, 2005. Google ScholarDigital Library
C. Fetzer. Perfect failure detection in timed asynchronous systems. IEEE Trans. Comput., 52(2):99--112, 2003. Google ScholarDigital Library
C. Fetzer, M. Raynal, and F. Tronel. An adaptive failure detection protocol. In Proc. of the 2001 Pacific Rim Int. Sym. on Dependable Computing, page 146, 2001. Google ScholarDigital Library
M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374--382, 1985. Google ScholarDigital Library
V. K. Garg and J. R. Mitchell. Implementable failure detectors in asynchronous systems. In Proc. 18th Conf. on Foundations of Software Technology and Theoretical Computer Science, number 1530, 1998. Google ScholarDigital Library
I. Gupta, T. D. Chandra, and G. S. Goldszmidt. On scalable and efficient distributed failure detectors. In Proc. of the 20th Annual ACM Sym. on Principles of Distributed Computing, pages 170--179, 2001. Google ScholarDigital Library
N. Hayashibara, A. Cherif, and T. Katayama. Failure detectors for large-scale distributed systems. In Proc. of the 21st IEEE Sym. on Reliable Distributed Systems, page 404, 2002. Google ScholarDigital Library
N. Hayashibara, X. Defago, R. Yared, and T. Katayama. The accrual failure detector. In 23rd IEEE Int. Sym. on Reliable Distributed Systems, pages 66--78, 2004. Google ScholarDigital Library
M. Hurfin, A. Mostefaoui, and M. Raynal;. Consensus in asynchronous systems where processes can crash and recover. In The 17th IEEE Sym. on Reliable Distributed Systems, pages 280--286, 1998. Google ScholarDigital Library
G. Neiger. Failure detectors and the wait-free hierarchy (extended abstract). In Proc. of the 14th Annual ACM Sym. on Principles of Distributed Computing, pages 100--109, 1995. Google ScholarDigital Library
R. C. Nunes and I. Jansch-Porto. Qos of timeout-based self-tuned failure detectors: The effects of the communication delay predictor and the safety margin. In 2004 Int. Conf. on Dependable Systems and Networks, page 753, 2004. Google ScholarDigital Library
R. Oliveira, R. Guerraoui, and A. Schiper. Consensus in the crash-recover model. Technical Report TR-97/239, 1997.Google Scholar
R. V. Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. Technical Report TR98--1687, 1998. Google ScholarDigital Library
I. Sotoma and E. R. M. Madeira. A markov model for quality of service of failure detectors in the pressure of loss bursts. In 18th Int. Conf. on Advanced Information Networking and Applications, volume 2, page 62, 2004. Google ScholarDigital Library
P. Stelling, C. DeMatteis, I. Foster, C. Kesselman, C. A. Lee, and G. von Laszewski. A fault detection service for wide area distributed computations. Cluster Computing, 2(2):117--128, 1999. Google ScholarDigital Library

Index Terms

Evaluation of the QoS of crash-recovery failure detection
1. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles
        Organizing principles for web applications

Recommendations

On the Quality of Service of Crash-Recovery Failure Detectors

We model the probabilistic behavior of a system comprising a failure detector and a monitored crash-recovery target. We extend failure detectors to take account of failure recovery in the target system. This involves extending QoS measures to include ...
Read More
On the Quality of Service of Crash-Recovery Failure Detectors
DSN '07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

In this paper, we study and model a crash-recovery target and its failure detector's probabilistic behavior. We extend Quality of Service (QoS) metrics to measure the recovery detection speed and the proportion of the detected failures of a crash-...
Read More
Failure detection and consensus in the crash-recovery model

We study the problems of failure detection and consensus in asynchronous systems in which processes may crash and recover, and links may lose messages. We first propose new failure detectors that are particularly suitable to the crash-recovery model. We ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '07: Proceedings of the 2007 ACM symposium on Applied computing
March 2007
1688 pages
ISBN:1595934804
DOI:10.1145/1244002
Conference Chairs:
Yookun Cho
Seoul National University, Seoul, Korea
,
Roger L. Wainwright
University of Tulsa, Tulsa, Oklahoma
,
Hisham M. Haddad
Kennesaw State University, Kennesaw, Georgia
,
Sung Y. Shin
South Dakota State University, Brookings, South Dakota
,
Program Chair:
Yong Wan Koo
The University of Suwon, Gyeongggi-do, Korea
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 March 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dependability
failure detection
fault tolerance
quality of services
reliability
web services
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 328
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluation of the QoS of crash-recovery failure detection

SAC '07: Proceedings of the 2007 ACM symposium on Applied computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Quality of Service of Crash-Recovery Failure Detectors

On the Quality of Service of Crash-Recovery Failure Detectors

Failure detection and consensus in the crash-recovery model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Evaluation of the QoS of crash-recovery failure detection

SAC '07: Proceedings of the 2007 ACM symposium on Applied computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Quality of Service of Crash-Recovery Failure Detectors

On the Quality of Service of Crash-Recovery Failure Detectors

Failure detection and consensus in the crash-recovery model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media