Understanding fault-tolerant distributed systems

Author:
Flavin Cristian

View Profile

Authors Info & Claims

Communications of the ACM Volume 34 Issue 2Feb. 1991pp 56–78https://doi.org/10.1145/102792.102801

Published:01 February 1991Publication History

Communications of the ACM

References

1 Abbadi, A.E., Skeen, D., Crisfian, Fo An efficient fault-tolerant protocol Fourth, ACM Conference on P?inciples of Database S~steras (1985).Google Scholar
2 Anderson, T., Lee, P. Fauit-toiernce-PrinciOles and Practice. Prentice Hall, 198I. Google ScholarDigital Library
3 Avizienis, A. Software fault tolerance. IFIP Comt~aer Com~'ress (San Francisco, Aug. i989),Google Scholar
4 avizienis. A,, Gunningberg, p. Kelly J. strigini, L., Traverse, P., Tso, K., Voges, U. The UCLA Dedix system : A distributed tested for multi-version software. 15th lnternationd Conference on Fauh-tolerand Computing (Ann Afar, Michi 1985).Google Scholar
5 Babaoglu, O., Drumoind, R.streets of Byzantium: Network architetures for fast realible brodcast . IEEE Trans. Softw. Eng. SE-11, 6, (1985).Google Scholar
6 Barbara. D., Garcia-Molina, H., Spauster, A, Increasing avability under Mutal exclusion constraints with Dynamic vote ressignment. ACM Trans. Comput. Sys.7,4(nov. 1989) Google ScholarDigital Library
7 Bartlett. J.A Nonstop Kernel. Eighth Sympossium on Operating System Principles (Dec. 1981). Google ScholarDigital Library
8 Bernstein, P.Sequoia: A fault-toler ant tightly coupled multiprocessor for traction Processing. IEEE Comput. (1988). Google ScholarDigital Library
9 Brnstein,P.,Hadzilacos, V.,Goodman,N. Concurrency Control and Re covery in Database Systems,Addision- WEsley, Reading, Mass.,1987. Google ScholarDigital Library
10 Birman, K,, Joseph T; Reliable communication in the presence of failures.ACM Trans. Comput.Sys.5, 1(Feb; t987). Google ScholarDigital Library
11 Borg,A., Blau,W., Graetch, W., Herrman, F., Oberle, W. Faultolerance underUnix ACMTrans Comput,Syst. 7,1(Feb 1989) Google ScholarDigital Library
12 Carr,R. The Tandem Global update Protocol.Tandem Sys. Rev. i,2 (June 1985).Google Scholar
13 Chang. J.M.,Maxemchuck, N. Reliable Broadcast Protocols. ACM Trans. Compt.Systs, 2,3(Ayg.1984). Google ScholarDigital Library
14 Xheriton, D., Zwaenpoel, W. Distibuted process groups in the V Kernel. ACM Trans Comput. Syst, 3,2 (May 2985). Google ScholarDigital Library
15 Clark, D. The Structuring of systems using up-calls. 10th ACM Symposmun in operting system principles (1985). Google ScholarDigital Library
16 Comer, D., Perterson, L.Understanding. Distributred Comput. 3 (1989), 51-60.Google ScholarDigital Library
17 Copper. E. Rep{licated distrubtation, programs. ph.D dissertration, UC Berkely, 1985.Google Scholar
18 Cristain, F.A. rigorous approach to faule-toleant programming. IEEE Trans. Softw. Eng. SE 11,1 (1985).Google Scholar
19 Cristain, F. Agreeing on who is present and who is absent in a synchronous distributed system. 18th International Conference on Fault- Tolerant Computing (Tokyo, June 1988).Google Scholar
20 Cristain, F. Exception handling. In Dependability of Resident Computers. T. Anderson, De., Blackwell Svientrific Publication, Oxford, 1989.Google Scholar
21 Cristain, F. Probailistic clock synchronization. Distributed Computing3 (1989), 146-158.Google Scholar
22 Cristain, F. Synchronous atomic broadcast for redundant broadcast channels. IBM Res. Rep. RJ 7203, Dec.1989.Google Scholar
23 Cristain, F., Aghilim H., Strong, R., Dolev, D. Atomic broadcast: From simple diffusion to Byzantine agreement. 15th International Conference on Fault-tolerant Computimng (Ann Arbor, Mich., 1985).Google Scholar
24 Cristain, F. Dancey, R. Dehn, J Fault-tolerant in the adacned automation system. 20th International Conference on Fault-tolerant Computing (Newcastle upon Tyne, England, June 1990).Google Scholar
25 Dijkstra. E. Hierarchuical ordering of sequential process. Acta Informatica 1 (1971), 115-138.Google ScholarDigital Library
26 Ezhilchelvan, P., Shrivastave, S., A characterization of faultsmin systems. Fifth Symposium on Reliabil;ity in Distributed Software and Database systems (Los Angeles. Jan. 1989).Google Scholar
27 Gracia-Moilina, H., Spauster, A. Message ordring in a multicast environment. Nonth Intrernational Conference on Distirbuted Systems (Newport Beach, Calif., June 1989).Google Scholar
28 Gray, J., Notes on Database Operating Systems. Operating Systems- An Advanced Course. Vol 60, LectureNotes inComputer Science, Springer Verlag, 1978. Google ScholarDigital Library
29 Gray, J. Why do Computers Sytop and what cna bne dpone about it? Fifth Symposium on Reliability in Distributed Softwarre and Database systems )Los Angles, Jan. 1986).Google Scholar
30 Harper, R., Lala, J., Deyst, J. Fault tolerant paralled processor architectuere overivew. 18th International Conference Fault-Tolerant Computing (Tokyo, June 1989).Google Scholar
31 Hopkins, A. Smith, B., Lala, J. FTMP-A highly reliable fault-toler ant multi-processor for aircraft. In Proceesings IEEE, Vol, 66. Ocy. 1978.Google Scholar
32 IBM International TEchnical Support Centeres. IMS/VS extended recovery faculity (XRF). Tech. Ref. 1978Google Scholar
33 Johson, D., Zwaenepoei, W. Sender based meeage logging. 17th Inernational Conference on Fault- Tolerant Computign (Tokyo, June 1987).Google Scholar
34 Kaashoek, F., Tanenbaum, A. Fauklt-tolerant using group communication. Fourth ACM SIGOPS European Workshop (Bologna, Sept. 1990). Google ScholarDigital Library
35 Knight, J., Amann, P. Issues infuencing the us of N-version programming in Processing of the IFIP congress(San Francisco, Aug. 1989).Google Scholar
36 Koo, R., Toueg, S. Check-pointing and rollbcak recovery for distribuuted systems. IEEE Trans. Softw. Eng. SE-13, 1 (1986). Google ScholarDigital Library
37 Kopetz, H., Curnstedi, G., Resiinger, J. Fault-tolerant membership in a sunchronous real-times systems. IFIP Working Conference on Dependable Computing for Critical Aplications (Santa Barbara, Aug. 1989).Google Scholar
38 Kronenberg, N., Levy. H., Strecker, w. VAXclusters: A Closely coupled distruted system. ACM Trans. Comput. Syst. 4,2 (1986). Google ScholarDigital Library
39 Ladin, R., Liskov, B., Shria, L., Lazy replication: A method for managing replicated data. Ninth Annual ACM Symposium on Prinicples of Distributed Computing (Aug. 1990).Google Scholar
40 Lamport, L., Using times instead of times-outs in falut-tolerant systems. ACM Trans, Prog. Lan. Syst, 6, 2 (1984). Google ScholarDigital Library
41 Lamport, L., The part time Parli ment. Ces SRC Rep. 49, Sept. 1989.Google Scholar
42 Lamport, L., Sturgis, H., Atomic Transactions in Distributed Systems: An Advanced Course. Lecture Notes in Computing Science Vol. 105. Springer Verlag, 1981.Google Scholar
43 Laprie, J.C. Dpendability: A unifying Concept For Reliable Computing and Fault-tolerant, T. Anderson, Ed., Blackwell Scientific Publications, Oxford, 1989.Google Scholar
44 Laprie, J.C. Arlat, J., Becounes, C., Kanoun, K. Definition and analysis od hardware and software-faulttolerant architectures. IEEE Comput. (July 1990). Google ScholarDigital Library
45 Le Lann, G. Critical issues in distributed realtimes computing. In preceedings of ESTEC Workship on communication Networks and Distribuuted Operating Systems within the Space Environment, European Space Agency REp. WPP-10, Noordwijk, Oct. 24-26, 1989.Google Scholar
46 Luan, S., Gligor, V. A fault-tolerant protocal for atomic broadcast. 10th International Conference on Distributed Computing Systems (Paris, May 1990).Google Scholar
47 McCluskey, E. Fault-tolerant systerms. Tech. Rep. CSL-199 Standfor Univ., 1982.Google Scholar
48 Melliar-Smith, M., Moser, L., Agrawale, V. Broadcast Protocols for distributed systems. IEEE Trans. Parallel and Distributed Syst. 1, 1(Jan. 1990). Google ScholarDigital Library
49 Oki, B., Liskov, B. Viewstamped replication: A new primary copy method to suport highly available distributed systems. Seventh ACM Symposium on Principles of Distributed Computing (Aug. 1988). Google ScholarDigital Library
50 Palumbe, D., Butler, R. Measurement of SIFT operating system overhead. NASA Tech. Mem. 86322,1985Google Scholar
51 Parnas, D. Desigining software for ease of extension and contraction IEEE Trans Softw. Eng. Se-5, 2 (Mar. 1979).Google Scholar
52 Peterson, W., Weldon, E. Error Correcting Codes. MIT Press, Cambridge, Mass., 1972.Google Scholar
53 Powell, D. La tolerant aux fautes dasns les systems repats: Les hupothese d'erreur er Leur importance. LAAS REs. Rep. 89-258, Sept. 1989.Google Scholar
54 Randell, B. System structure for software fault-tolerant. IEEE Trans. Soft. Eng. SE-1,2 (1975).Google Scholar
55 Saltzer. J., Reed., D., Clark, D.Endto-end arguments in system design. ACM Trans. Comput. Syst., 2,4, (Nov. 1984). Google ScholarDigital Library
56 Schmuch, F. The us of efficent broadcast protocos in asynchronous distributed systems. ph.D Disseration TR88-928 Cornell Univ., 1988. Google ScholarDigital Library
57 Schneider, F. The state machine approach: A tutorial. TR 86-800, Cornell Univ., 1986. Google ScholarDigital Library
58 Sieworek, D. Fault-tolerant in commercial computers. IEEE Comput. (July 1990). Google ScholarDigital Library
59 Strom, R., Yeminio, S Ootimistic recovery in distributed systems. ACM Trans. Comput.syst., 3, 3 (1986). Google ScholarDigital Library
60 Strong, R. Skeen, D., Cristian, F., Aghili, H. Handshake protocols. Seventh International Conference on Distributed Computing Systems (Berlin, Sept. 1978).Google Scholar
61 Tanenbaum, A. Computer Networks. Prientice. Hall, Englewood Cliffs, N.J., 1981. Google ScholarDigital Library
62 Taylor, D. and Wilison, G. The Strtus system architecture. In Dependability of Resilient Computer. T. An derson, Ed., Blackwell Scientific Publication, Oxford, 1989.Google Scholar
63 Trivedi, K. Probality and Statistics with reliablity, Queuing and Computer Science Application. Prentice Hall, Englewood Cliffs, N.J., 1982. Google ScholarDigital Library
64 Verissimo, P., Rodrigues, L., Baptista, M. AMp: A highly parallel atomic multicast protocol. In Proceedings ACM SEGGOM'89 (Austin, Tex., Sept. 89). Google ScholarDigital Library
65 Wakerly. J. Error deteching codes, selfchecking circuits and applications. El servier Noth Holland, Inc., N.Y., 1978.Google Scholar
66 Wensely, J., Lamport, L., Goldberg, J., Green M., Levitt, K., Melliar- Smith, M., Shostak, R. Weinstock, C. SIFT : Design and analysis of a fault tolerant computer for aircraft contorl. Proceedings IEEE, Vol. 66, Oct. 1978.Google Scholar
67 Wulf, W. Reliable hardware-software architecture. 1975 International Conference on Reliable Software, SIGPLAN 10, 6 (1975). Google ScholarDigital Library

Index Terms

Understanding fault-tolerant distributed systems

Recommendations

Fault Injection and Dependability Evaluation of Fault-Tolerant Systems

The authors describe a dependability evaluation method based on fault injection that establishes the link between the experimental evaluation of the fault tolerance process and the fault occurrence process. The main characteristics of a fault injection ...
Read More
Graceful Degradation in Algorithm-Based Fault Tolerant Multiprocessor Systems

Algorithm-based fault tolerance (ABFT) is a technique which improves the reliability of a multiprocessor system by providing concurrent error detection and fault location capability to it. It encodes data at the system level and modifies the algorithm ...
Read More
Fault tolerant distributed shared memory algorithms
SPDP '90: Proceedings of the 1990 IEEE Second Symposium on Parallel and Distributed Processing

Distributed shared memory (DSM) has received increased attention as a mechanism for interprocess communication in loosely-coupled distributed systems because of its perceived advantages over direct use of message passing or remote procedure calls. One ...
Read More

Reviews

Reviewer: Robert Joel Hofkin

Nomenclature is always a problem in rapidly developing areas such as fault-tolerant computing or distributed systems. What at first appears to be a serious disagreement may be nothing more than an unfortunate choice of words. We often use many different terms for one concept, and sometimes one term denotes several concepts. We can learn from even a poor attempt to organize a field into some unifying concepts, and this paper represents some good analysis. It will probably not be the definitive description of distributed, fault-tolerant systems, but it is certainly a reasonable starting point. The focus is on clearly defined terminology for the unit of failure in software and hardware, and on the propagation semantics when one of these units fails. The perspective is one of describing the structure of known systems, not of postulating new systems. This viewpoint is at once the paper's strength and its weakness: it allows a concise description of the field today, but it may be superseded by new discoveries. Also, the choice of terminology is good, but more discussion of the alternatives might have been helpful. Fortunately, CACM 's typographic adventurism has not hurt the legibility of this well-written paper, so it can serve as an introduction to an area that undoubtedly will grow in importance and sophistication.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 34, Issue 2
Feb. 1991
64 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/102792
Editor:
Peter Denning
NASA Ames Research Center, Moffett Field, CA
Issue’s Table of Contents
Copyright © 1991 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 February 1991
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 470
  Total Citations
  View Citations
- 10,271
  Total Downloads
- Downloads (Last 12 months)1,291
- Downloads (Last 6 weeks)195
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Understanding fault-tolerant distributed systems

Communications of the ACM

References

Cited By

Index Terms

Recommendations

Fault Injection and Dependability Evaluation of Fault-Tolerant Systems

Graceful Degradation in Algorithm-Based Fault Tolerant Multiprocessor Systems

Fault tolerant distributed shared memory algorithms

Reviews

Access critical reviews of Computing literature here