skip to main content
10.1145/800040.801400acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
Article
Free Access

Performance analysis of checkpointing strategies

Authors Info & Claims
Published:29 August 1983Publication History

ABSTRACT

A widely used error recovery technique in database systems is the rollback and recovery technique. This technique saves periodically the state of the system and records all activities on a reliable log tape. The operation of saving the system state is called checkpointing. The elapsed time between two consecutive checkpointing operations is called checkpointing interval. When the system fails, the recovery process uses the log tape and the state saved at the most recent checkpoint to bring the system to the correct state that preceded the failure. This process is called error recovery and consists of loading the most recent state and then reprocessing all the activities, stored on the log tape, that took place since the most recent checkpoint and prior to failure.

Former models of rollback and recovery assumed Poisson failures and fixed (or exponential) checkpointing intervals. Extending these models, we consider general failure distributions. We also allow checkpointing intervals to depend on the reprocessing time (the time elapsed between the most recent checkpoint prior to failure and the time of failure) and the failure distribution. Furthermore, failures may occur during the checkpointing and error recovery. Our general model unifies a variety of models that have previously been investigated.

We denote by Fi; and t(Fi), i = 1, 2, ..., the ith failure that occurs during normal processing (not during error recovery) and the time of its occurrence, respectively. We refer to the time period Li = t(Fi+1) − t(Fi), i = 1, 2, ..., as the ith cycle whose length is Li. It consists of two portions: the total error recovery time and the normal processing time. The reprocessing time associated with failure Fi is denoted by Yi−1. Since the variables of the ith cycle depend at most on one variable of the (i − 1)st cycle, namely Yi−1, the stochastic process of the reprocessing time {Yi; i≥0} is a Markov process. We obtain the transition probability density function and the stationary distribution of this process.

The performance of the system is measured by the availability, the fraction of time the system is not checkpointing or recovering from errors. In equilibrium, the system availability is expressed as the ratio of the mean production time (normal processing time excluding checkpointing time) during a cycle and the mean length of the cycle. We obtain a general expression for the system availability in our general model.

The checkpointing strategy is characterized by the sequence of checkpointing intervals. For the well-known equidistant checkpointing strategy, in which the checkpointing intervals are constant, we find that the resulting system availability depends only on the mean of the failure distribution. We define a checkpointing strategy as failure-dependent if the sequence of checkpointing intervals depends on the failure distribution. Checkpointing strategies that result in a checkpointing operation immediately after error recovery are called reprocessing-independent strategies. We then introduce a novel checkpointing strategy, the equicost strategy, which is failure-dependent and reprocessing-independent. This strategy suggests that a checkpointing operation is to be performed whenever the mean reprocessing cost equals the mean checkpointing cost. Interestingly, the equicost strategy leads to fixed checkpointing intervals for Poisson failures. We compare the maximum system availability resulting from the equidistant and the equicost checkpointing strategies under Weibull distributions which are good approximations of actual failure distributions. Computational results based on Weibull failure distributions (both increasing and decreasing failure rates) show that the equicost strategy achieves higher system availability than the equidistant strategy which is known to be optimal under Poisson failures.

Index Terms

  1. Performance analysis of checkpointing strategies

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMETRICS '83: Proceedings of the 1983 ACM SIGMETRICS conference on Measurement and modeling of computer systems
          August 1983
          286 pages
          ISBN:0897911121
          DOI:10.1145/800040

          Copyright © 1983 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 29 August 1983

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate459of2,691submissions,17%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader