skip to main content
article
Free Access

Understanding fault-tolerant distributed systems

Published:01 February 1991Publication History
First page image

References

  1. 1 Abbadi, A.E., Skeen, D., Crisfian, Fo An efficient fault-tolerant protocol Fourth, ACM Conference on P?inciples of Database S~steras (1985).Google ScholarGoogle Scholar
  2. 2 Anderson, T., Lee, P. Fauit-toiernce-PrinciOles and Practice. Prentice Hall, 198I. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3 Avizienis, A. Software fault tolerance. IFIP Comt~aer Com~'ress (San Francisco, Aug. i989),Google ScholarGoogle Scholar
  4. 4 avizienis. A,, Gunningberg, p. Kelly J. strigini, L., Traverse, P., Tso, K., Voges, U. The UCLA Dedix system : A distributed tested for multi-version software. 15th lnternationd Conference on Fauh-tolerand Computing (Ann Afar, Michi 1985).Google ScholarGoogle Scholar
  5. 5 Babaoglu, O., Drumoind, R.streets of Byzantium: Network architetures for fast realible brodcast . IEEE Trans. Softw. Eng. SE-11, 6, (1985).Google ScholarGoogle Scholar
  6. 6 Barbara. D., Garcia-Molina, H., Spauster, A, Increasing avability under Mutal exclusion constraints with Dynamic vote ressignment. ACM Trans. Comput. Sys.7,4(nov. 1989) Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7 Bartlett. J.A Nonstop Kernel. Eighth Sympossium on Operating System Principles (Dec. 1981). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8 Bernstein, P.Sequoia: A fault-toler ant tightly coupled multiprocessor for traction Processing. IEEE Comput. (1988). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9 Brnstein,P.,Hadzilacos, V.,Goodman,N. Concurrency Control and Re covery in Database Systems,Addision- WEsley, Reading, Mass.,1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10 Birman, K,, Joseph T; Reliable communication in the presence of failures.ACM Trans. Comput.Sys.5, 1(Feb; t987). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11 Borg,A., Blau,W., Graetch, W., Herrman, F., Oberle, W. Faultolerance underUnix ACMTrans Comput,Syst. 7,1(Feb 1989) Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12 Carr,R. The Tandem Global update Protocol.Tandem Sys. Rev. i,2 (June 1985).Google ScholarGoogle Scholar
  13. 13 Chang. J.M.,Maxemchuck, N. Reliable Broadcast Protocols. ACM Trans. Compt.Systs, 2,3(Ayg.1984). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14 Xheriton, D., Zwaenpoel, W. Distibuted process groups in the V Kernel. ACM Trans Comput. Syst, 3,2 (May 2985). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15 Clark, D. The Structuring of systems using up-calls. 10th ACM Symposmun in operting system principles (1985). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16 Comer, D., Perterson, L.Understanding. Distributred Comput. 3 (1989), 51-60.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17 Copper. E. Rep{licated distrubtation, programs. ph.D dissertration, UC Berkely, 1985.Google ScholarGoogle Scholar
  18. 18 Cristain, F.A. rigorous approach to faule-toleant programming. IEEE Trans. Softw. Eng. SE 11,1 (1985).Google ScholarGoogle Scholar
  19. 19 Cristain, F. Agreeing on who is present and who is absent in a synchronous distributed system. 18th International Conference on Fault- Tolerant Computing (Tokyo, June 1988).Google ScholarGoogle Scholar
  20. 20 Cristain, F. Exception handling. In Dependability of Resident Computers. T. Anderson, De., Blackwell Svientrific Publication, Oxford, 1989.Google ScholarGoogle Scholar
  21. 21 Cristain, F. Probailistic clock synchronization. Distributed Computing3 (1989), 146-158.Google ScholarGoogle Scholar
  22. 22 Cristain, F. Synchronous atomic broadcast for redundant broadcast channels. IBM Res. Rep. RJ 7203, Dec.1989.Google ScholarGoogle Scholar
  23. 23 Cristain, F., Aghilim H., Strong, R., Dolev, D. Atomic broadcast: From simple diffusion to Byzantine agreement. 15th International Conference on Fault-tolerant Computimng (Ann Arbor, Mich., 1985).Google ScholarGoogle Scholar
  24. 24 Cristain, F. Dancey, R. Dehn, J Fault-tolerant in the adacned automation system. 20th International Conference on Fault-tolerant Computing (Newcastle upon Tyne, England, June 1990).Google ScholarGoogle Scholar
  25. 25 Dijkstra. E. Hierarchuical ordering of sequential process. Acta Informatica 1 (1971), 115-138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. 26 Ezhilchelvan, P., Shrivastave, S., A characterization of faultsmin systems. Fifth Symposium on Reliabil;ity in Distributed Software and Database systems (Los Angeles. Jan. 1989).Google ScholarGoogle Scholar
  27. 27 Gracia-Moilina, H., Spauster, A. Message ordring in a multicast environment. Nonth Intrernational Conference on Distirbuted Systems (Newport Beach, Calif., June 1989).Google ScholarGoogle Scholar
  28. 28 Gray, J., Notes on Database Operating Systems. Operating Systems- An Advanced Course. Vol 60, LectureNotes inComputer Science, Springer Verlag, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. 29 Gray, J. Why do Computers Sytop and what cna bne dpone about it? Fifth Symposium on Reliability in Distributed Softwarre and Database systems )Los Angles, Jan. 1986).Google ScholarGoogle Scholar
  30. 30 Harper, R., Lala, J., Deyst, J. Fault tolerant paralled processor architectuere overivew. 18th International Conference Fault-Tolerant Computing (Tokyo, June 1989).Google ScholarGoogle Scholar
  31. 31 Hopkins, A. Smith, B., Lala, J. FTMP-A highly reliable fault-toler ant multi-processor for aircraft. In Proceesings IEEE, Vol, 66. Ocy. 1978.Google ScholarGoogle Scholar
  32. 32 IBM International TEchnical Support Centeres. IMS/VS extended recovery faculity (XRF). Tech. Ref. 1978Google ScholarGoogle Scholar
  33. 33 Johson, D., Zwaenepoei, W. Sender based meeage logging. 17th Inernational Conference on Fault- Tolerant Computign (Tokyo, June 1987).Google ScholarGoogle Scholar
  34. 34 Kaashoek, F., Tanenbaum, A. Fauklt-tolerant using group communication. Fourth ACM SIGOPS European Workshop (Bologna, Sept. 1990). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. 35 Knight, J., Amann, P. Issues infuencing the us of N-version programming in Processing of the IFIP congress(San Francisco, Aug. 1989).Google ScholarGoogle Scholar
  36. 36 Koo, R., Toueg, S. Check-pointing and rollbcak recovery for distribuuted systems. IEEE Trans. Softw. Eng. SE-13, 1 (1986). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. 37 Kopetz, H., Curnstedi, G., Resiinger, J. Fault-tolerant membership in a sunchronous real-times systems. IFIP Working Conference on Dependable Computing for Critical Aplications (Santa Barbara, Aug. 1989).Google ScholarGoogle Scholar
  38. 38 Kronenberg, N., Levy. H., Strecker, w. VAXclusters: A Closely coupled distruted system. ACM Trans. Comput. Syst. 4,2 (1986). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. 39 Ladin, R., Liskov, B., Shria, L., Lazy replication: A method for managing replicated data. Ninth Annual ACM Symposium on Prinicples of Distributed Computing (Aug. 1990).Google ScholarGoogle Scholar
  40. 40 Lamport, L., Using times instead of times-outs in falut-tolerant systems. ACM Trans, Prog. Lan. Syst, 6, 2 (1984). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. 41 Lamport, L., The part time Parli ment. Ces SRC Rep. 49, Sept. 1989.Google ScholarGoogle Scholar
  42. 42 Lamport, L., Sturgis, H., Atomic Transactions in Distributed Systems: An Advanced Course. Lecture Notes in Computing Science Vol. 105. Springer Verlag, 1981.Google ScholarGoogle Scholar
  43. 43 Laprie, J.C. Dpendability: A unifying Concept For Reliable Computing and Fault-tolerant, T. Anderson, Ed., Blackwell Scientific Publications, Oxford, 1989.Google ScholarGoogle Scholar
  44. 44 Laprie, J.C. Arlat, J., Becounes, C., Kanoun, K. Definition and analysis od hardware and software-faulttolerant architectures. IEEE Comput. (July 1990). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. 45 Le Lann, G. Critical issues in distributed realtimes computing. In preceedings of ESTEC Workship on communication Networks and Distribuuted Operating Systems within the Space Environment, European Space Agency REp. WPP-10, Noordwijk, Oct. 24-26, 1989.Google ScholarGoogle Scholar
  46. 46 Luan, S., Gligor, V. A fault-tolerant protocal for atomic broadcast. 10th International Conference on Distributed Computing Systems (Paris, May 1990).Google ScholarGoogle Scholar
  47. 47 McCluskey, E. Fault-tolerant systerms. Tech. Rep. CSL-199 Standfor Univ., 1982.Google ScholarGoogle Scholar
  48. 48 Melliar-Smith, M., Moser, L., Agrawale, V. Broadcast Protocols for distributed systems. IEEE Trans. Parallel and Distributed Syst. 1, 1(Jan. 1990). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. 49 Oki, B., Liskov, B. Viewstamped replication: A new primary copy method to suport highly available distributed systems. Seventh ACM Symposium on Principles of Distributed Computing (Aug. 1988). Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. 50 Palumbe, D., Butler, R. Measurement of SIFT operating system overhead. NASA Tech. Mem. 86322,1985Google ScholarGoogle Scholar
  51. 51 Parnas, D. Desigining software for ease of extension and contraction IEEE Trans Softw. Eng. Se-5, 2 (Mar. 1979).Google ScholarGoogle Scholar
  52. 52 Peterson, W., Weldon, E. Error Correcting Codes. MIT Press, Cambridge, Mass., 1972.Google ScholarGoogle Scholar
  53. 53 Powell, D. La tolerant aux fautes dasns les systems repats: Les hupothese d'erreur er Leur importance. LAAS REs. Rep. 89-258, Sept. 1989.Google ScholarGoogle Scholar
  54. 54 Randell, B. System structure for software fault-tolerant. IEEE Trans. Soft. Eng. SE-1,2 (1975).Google ScholarGoogle Scholar
  55. 55 Saltzer. J., Reed., D., Clark, D.Endto-end arguments in system design. ACM Trans. Comput. Syst., 2,4, (Nov. 1984). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. 56 Schmuch, F. The us of efficent broadcast protocos in asynchronous distributed systems. ph.D Disseration TR88-928 Cornell Univ., 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. 57 Schneider, F. The state machine approach: A tutorial. TR 86-800, Cornell Univ., 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. 58 Sieworek, D. Fault-tolerant in commercial computers. IEEE Comput. (July 1990). Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. 59 Strom, R., Yeminio, S Ootimistic recovery in distributed systems. ACM Trans. Comput.syst., 3, 3 (1986). Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. 60 Strong, R. Skeen, D., Cristian, F., Aghili, H. Handshake protocols. Seventh International Conference on Distributed Computing Systems (Berlin, Sept. 1978).Google ScholarGoogle Scholar
  61. 61 Tanenbaum, A. Computer Networks. Prientice. Hall, Englewood Cliffs, N.J., 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. 62 Taylor, D. and Wilison, G. The Strtus system architecture. In Dependability of Resilient Computer. T. An derson, Ed., Blackwell Scientific Publication, Oxford, 1989.Google ScholarGoogle Scholar
  63. 63 Trivedi, K. Probality and Statistics with reliablity, Queuing and Computer Science Application. Prentice Hall, Englewood Cliffs, N.J., 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. 64 Verissimo, P., Rodrigues, L., Baptista, M. AMp: A highly parallel atomic multicast protocol. In Proceedings ACM SEGGOM'89 (Austin, Tex., Sept. 89). Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. 65 Wakerly. J. Error deteching codes, selfchecking circuits and applications. El servier Noth Holland, Inc., N.Y., 1978.Google ScholarGoogle Scholar
  66. 66 Wensely, J., Lamport, L., Goldberg, J., Green M., Levitt, K., Melliar- Smith, M., Shostak, R. Weinstock, C. SIFT : Design and analysis of a fault tolerant computer for aircraft contorl. Proceedings IEEE, Vol. 66, Oct. 1978.Google ScholarGoogle Scholar
  67. 67 Wulf, W. Reliable hardware-software architecture. 1975 International Conference on Reliable Software, SIGPLAN 10, 6 (1975). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Understanding fault-tolerant distributed systems

                Recommendations

                Reviews

                Robert Joel Hofkin

                Nomenclature is always a problem in rapidly developing areas such as fault-tolerant computing or distributed systems. What at first appears to be a serious disagreement may be nothing more than an unfortunate choice of words. We often use many different terms for one concept, and sometimes one term denotes several concepts. We can learn from even a poor attempt to organize a field into some unifying concepts, and this paper represents some good analysis. It will probably not be the definitive description of distributed, fault-tolerant systems, but it is certainly a reasonable starting point. The focus is on clearly defined terminology for the unit of failure in software and hardware, and on the propagation semantics when one of these units fails. The perspective is one of describing the structure of known systems, not of postulating new systems. This viewpoint is at once the paper's strength and its weakness: it allows a concise description of the field today, but it may be superseded by new discoveries. Also, the choice of terminology is good, but more discussion of the alternatives might have been helpful. Fortunately, CACM 's typographic adventurism has not hurt the legibility of this well-written paper, so it can serve as an introduction to an area that undoubtedly will grow in importance and sophistication.

                Access critical reviews of Computing literature here

                Become a reviewer for Computing Reviews.

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image Communications of the ACM
                  Communications of the ACM  Volume 34, Issue 2
                  Feb. 1991
                  64 pages
                  ISSN:0001-0782
                  EISSN:1557-7317
                  DOI:10.1145/102792
                  • Editor:
                  • Peter Denning
                  Issue’s Table of Contents

                  Copyright © 1991 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 1 February 1991

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • article

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader