Abstract
The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models—Byzantine and fail stop. Systems reconfiguration techniques for removing faulty components and integrating repaired components are also discussed.
- AIZIKOWITZ, J. 1989. Designing distributed services using refinement mappings. Ph.D. dissertation, Computer Science Dept., Cornell Univ., Ithaca, New York. Also available as Tech. Rep. TR 89-1040. Google Scholar
- BERNSTEIN, A. J. 1985. A loosely coupled system for reliably storing data. IEEE Trans. Softw. Eng. SE-11, 5 (May), 446-454.Google Scholar
- BIRMAN, K. P. 1985. Replication and fault tolerance in the ISIS system. In Proceedings of the lOth A CM Symposium on Operating Systems Principles (Orcas Island, Washington, Dec. 1985), A CM, pp. 79-86. Google Scholar
- BIRMAN, K. P., AND JOSEPH, T. 1987. Reliable communication in the presence of failures. ACM TOCS 5, 1 (Feb. 1987), 47-76. Google Scholar
- CRISTIAN, F., AGHILI, H., STRONG, H. R., AND DOLEV, D. 1985. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th International Conference on Fault-tolerant Computing (Ann Arbor, Mich., June 1985), IEEE Computer Society.Google Scholar
- DIJKSTRA, E. W. 1974. Self stabilization in spite of distributed control. Commun. A CM I7, 11 (Nov.), 643-644. Google Scholar
- FISCHER, M., LYNCH, N., AND PATERSON, M. 1985. Impossibility of distributed consensus with one faulty process, d. ACM 32, 2 (Apr. 1986), 374-382. Google Scholar
- GARCIA-MOLINA, H., PITTELLI, F., AND DAVIDSON, S. 1986. Application of Byzantine agreement in database systems. ACM TODS 11, 1 (Mar. 1986), 27-47. Google Scholar
- GOPAL, A., STRONG, R., TOUEG, S., AND CRISTIAN, F., 1990. Early-delivery atomic broadcast. To appear in Proceedings of the 9th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (Quebec City, Quebec, Aug. 1990). Google Scholar
- GRAY, J. 1978. Notes on data base operating systems. In Operating Systems: An Advanced Course, Lecture Notes in Computer Science. Vol. 60. Springer- Verlag, New York, pp. 393-481. Google Scholar
- HALPERN, J., SIMONS, B., STRONG, R., AND DOLEV, D. 1984. Fault-tolerant clock synchronization. In Proceedings of the 3rd A CM SIGA CT-SIGOPS Symposium on Principles of Distributed Computing (Vancouver, Canada, Aug.), pp. 89-102. Google Scholar
- HUTCHINSON, N., AND PETERSON, L. 1988. Design of the x-kernel. In Proceedings of SIGCOMM '88--Symposium on Communication Architectures and Protocols (Stanford, Calif., Aug.), pp. 65-75. Google Scholar
- LAMPORT, L. 1978a. Time, clocks and the ordering of events in a distributed system. Commun. ACM 21, 7 (July), 558-565. Google Scholar
- LAMPORT, L. 1979b. The implementation of reliable distributed multiprocess systems. Comput. Networks 2, 95-114.Google Scholar
- LAMPORT, L. 1984. Using time instead of timeout for fault-tolerance in distributed systems. ACM TOPLAS 6, 2 (Apr.), 254-280. Google Scholar
- LAMPORT, L. 1989. The part-time parliament. Tech. Rep. 49. Digital Equipment Corporation Systems Research Center, Palo Alto, Calif.Google Scholar
- LAMPORT, L., AND MELLIAR-SMITH, P. M. 1984. Byzantine clock synchronization. In Proceedings of the 3rd A CM SIGA CT-SIGOPS Symposium on Principles of Distributed Computing (Vancouver, Canada, Aug.), 68-74. Google Scholar
- LAMPORT, L., SHOSTAK, R., AND PEASE, M. 1982. The Byzantine generals problem. ACM TOPLAS 4, 3 (July), 382-401. Google Scholar
- LISKOV, B., AND LADIN, R. 1986. Highly available distributed services and fault-tolerant distributed garbage collection. In Proceedings of the 5th A CM Symposium on Principles of Distributed Computing (Calgary, Alberta, Canada, Aug.), ACM, pp. 29-39. Google Scholar
- MANCINI, L., AND PAPPALARDO, G. 1988. Towards a theory of replicated processing. Formal Techniques in Real-Time and Fault-Tolerant Systems. Lecture Notes in Computer Science, Vol. 331. Springer-Verlag, New York, pp. 175-192. Google Scholar
- MARZULLO, K. 1989. Implementing fault-tolerant sensors. Tech. Rep. TR 89-997. Computer Science Dept., Cornell Univ., Ithaca, New York.Google Scholar
- MARZULLO, K., AND SCHMUCK, F. 1988. Supplying high availability with a standard network file system. In Proceedings of the 8th International Conference on Distributed Computing Systems (San Jose, CA, June), IEEE Computer Society, pp. 447-455.Google Scholar
- PETERSON, L. L., BUCHOLZ, N. C., AND SCHLICHT- ING, R. D. 1989. Preserving and using context information in interprocess communication. ACM TOCS 7, 3 (Aug.), 217-246. Google Scholar
- PITTELLI, F. M., AND GARCIA-MOLINA, S. 1989. Reliable scheduling in a TMR database system. ACM TOCS 7, 1 (Feb.), 25-60. Google Scholar
- SCHLICHTING, R. D., AND SCHNEIDER, F. B. 1983. Fail-Stop processors: An approach to designing fault-tolerant computing systems. ACM TOCS I, 3 (Aug.), 222-238. Google Scholar
- SCHNEIDER, F. B. 1980. Ensuring consistency on a distributed database system by use of distributed semaphores. In Proceedings of International Symposium on Distributed Data Bases (Paris, France, Mar.), INRIA, pp. 183-189.Google Scholar
- SCHNEIDER, F. B. 1982. Synchronization in distributed programs. ACM TOPLAS 4, 2 (Apr.), 179-195. Google Scholar
- SCHNEIDER, F. B. 1984. Byzantine generals in action: Implementing fail-stop processors. ACM TOCS 2, 2 (May), 145-154. Google Scholar
- SCHNEIDER, F. B. 1985. Paradigms for distributed programs. Distributed Systems. Methods and Tools for Specification. Lecture Notes in Computer Science, Vol. 190. Springer-Verlag, New York, pp. 343-430. Google Scholar
- SCHNEIDER, F. B. 1986. A paradigm for reliable clock synchronization. In Proceedings of the Advanced Seminar on Real-Time Local Area Networks (Bandol, France, Apr.), INRIA, pp. 85-104.Google Scholar
- SCHNEIDER, F. B., GRIES, D., AND SCHLICHTING, R. D. 1984. Fault-tolerant broadcasts. Sci. Comput. Program. 4, 1-15. Google Scholar
- SIEWIOREK, D. P., AND SWARZ, R. S. 1982. The Theory and Practice of Reliable System Design. Digital Press, Bedford, Mass.Google Scholar
- SKEEN, D. 1982. Crash recovery in a distributed database system. Ph.D. dissertation, Univ. of California at Berkeley, May.Google Scholar
- STRONG, H. R., AND DOLEV, D. 1983. Byzantine agreement. Intellectual Leverage for the Information Society, Digest of Papers. (Compcon 83, IEEE Computer Society, Mar.), IEEE Computer Society, pp. 77-82.Google Scholar
- WENSLEY, J., WENSKY, J. H., LAMPORT, L., GOLDBERG, J., GREEN, M. W., LEVITT, K. N., MELLIAR-SMITH, P. M., SHOSTAK, R. E., and WEINSTOCK, C. B. 1978. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proc. IEEE 66, 10 (Oct.), 1240-1255.Google Scholar
Recommendations
Implementing fault-tolerant services using state machines: beyond replication
DISC'10: Proceedings of the 24th international conference on Distributed computingThis paper describes a method to implement fault-tolerant services in distributed systems based on the idea of fused state machines. The theory of fused state machines uses a combination of coding theory and replication to ensure efficiency as well as ...
Byzantine Fault Tolerant State Machine Replication in Any Programming Language
PODC '19: Proceedings of the 2019 ACM Symposium on Principles of Distributed ComputingState machine replication is a fundamental primitive in fault tolerant distributed computing, but few production tools exist to support the replication of arbitrary state machines. The tools that do exist, like Apache Zookeeper, CoreOS's etcd, and ...
Fault tolerant control using a fuzzy predictive approach
This paper proposes the application of fault-tolerant control (FTC) using fuzzy predictive control. The FTC approach is based on two steps, fault detection and isolation (FDI) and fault accommodation. The fault detection is performed by a model-based ...
Comments