Article

Free Access

Performance analysis of checkpointing strategies

Authors:
Asser N. Tantawi

IBM Thomas J. Watson Research Center, Yorktown Heights, New York

IBM Thomas J. Watson Research Center, Yorktown Heights, New York
View Profile

,
Manfred Ruschitzka

Department of Electrical and Computer Engineering, University of California, Davis, California

Department of Electrical and Computer Engineering, University of California, Davis, California
View Profile

SIGMETRICS '83: Proceedings of the 1983 ACM SIGMETRICS conference on Measurement and modeling of computer systemsAugust 1983https://doi.org/10.1145/800040.801400

Published:29 August 1983Publication History

SIGMETRICS '83: Proceedings of the 1983 ACM SIGMETRICS conference on Measurement and modeling of computer systems

ABSTRACT

A widely used error recovery technique in database systems is the rollback and recovery technique. This technique saves periodically the state of the system and records all activities on a reliable log tape. The operation of saving the system state is called checkpointing. The elapsed time between two consecutive checkpointing operations is called checkpointing interval. When the system fails, the recovery process uses the log tape and the state saved at the most recent checkpoint to bring the system to the correct state that preceded the failure. This process is called error recovery and consists of loading the most recent state and then reprocessing all the activities, stored on the log tape, that took place since the most recent checkpoint and prior to failure.

Former models of rollback and recovery assumed Poisson failures and fixed (or exponential) checkpointing intervals. Extending these models, we consider general failure distributions. We also allow checkpointing intervals to depend on the reprocessing time (the time elapsed between the most recent checkpoint prior to failure and the time of failure) and the failure distribution. Furthermore, failures may occur during the checkpointing and error recovery. Our general model unifies a variety of models that have previously been investigated.

We denote by F_i; and t(F_i), i = 1, 2, ..., the i^th failure that occurs during normal processing (not during error recovery) and the time of its occurrence, respectively. We refer to the time period L_i = t(F_i+1) − t(F_i), i = 1, 2, ..., as the i^th cycle whose length is L_i. It consists of two portions: the total error recovery time and the normal processing time. The reprocessing time associated with failure F_i is denoted by Y_i−1. Since the variables of the i^th cycle depend at most on one variable of the (i − 1)^st cycle, namely Y_i−1, the stochastic process of the reprocessing time {Y_i; i≥0} is a Markov process. We obtain the transition probability density function and the stationary distribution of this process.

The performance of the system is measured by the availability, the fraction of time the system is not checkpointing or recovering from errors. In equilibrium, the system availability is expressed as the ratio of the mean production time (normal processing time excluding checkpointing time) during a cycle and the mean length of the cycle. We obtain a general expression for the system availability in our general model.

The checkpointing strategy is characterized by the sequence of checkpointing intervals. For the well-known equidistant checkpointing strategy, in which the checkpointing intervals are constant, we find that the resulting system availability depends only on the mean of the failure distribution. We define a checkpointing strategy as failure-dependent if the sequence of checkpointing intervals depends on the failure distribution. Checkpointing strategies that result in a checkpointing operation immediately after error recovery are called reprocessing-independent strategies. We then introduce a novel checkpointing strategy, the equicost strategy, which is failure-dependent and reprocessing-independent. This strategy suggests that a checkpointing operation is to be performed whenever the mean reprocessing cost equals the mean checkpointing cost. Interestingly, the equicost strategy leads to fixed checkpointing intervals for Poisson failures. We compare the maximum system availability resulting from the equidistant and the equicost checkpointing strategies under Weibull distributions which are good approximations of actual failure distributions. Computational results based on Weibull failure distributions (both increasing and decreasing failure rates) show that the equicost strategy achieves higher system availability than the equidistant strategy which is known to be optimal under Poisson failures.

Index Terms

Performance analysis of checkpointing strategies
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance
        Checkpoint / restart
      2. Software performance

Recommendations

Performance analysis of checkpointing strategies
Read More
Multilevel Diskless Checkpointing

Extreme scale systems available before the end of this decade are expected to have 100 million to 1 billion CPU cores. The probability that a failure occurs during an application execution is expected to be much higher than today's systems. ...
Read More
Diskless Checkpointing

Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMETRICS '83: Proceedings of the 1983 ACM SIGMETRICS conference on Measurement and modeling of computer systems
August 1983
286 pages
ISBN:0897911121
DOI:10.1145/800040
Chairmen:
Herbert D. Schwetman,
Steven C. Bruell,
Larry W. Dowdy
Copyright © 1983 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 August 1983
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate459of2,691submissions,17%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 194
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Performance analysis of checkpointing strategies

SIGMETRICS '83: Proceedings of the 1983 ACM SIGMETRICS conference on Measurement and modeling of computer systems

ABSTRACT

Cited By

Index Terms

Recommendations

Performance analysis of checkpointing strategies

Multilevel Diskless Checkpointing

Diskless Checkpointing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Performance analysis of checkpointing strategies

SIGMETRICS '83: Proceedings of the 1983 ACM SIGMETRICS conference on Measurement and modeling of computer systems

ABSTRACT

Cited By

Index Terms

Recommendations

Performance analysis of checkpointing strategies

Multilevel Diskless Checkpointing

Diskless Checkpointing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media