article

Free Access

Reliable communication in the presence of failures

Authors:
Kenneth P. Birman

Cornell Univ., Ithaca, NY

Cornell Univ., Ithaca, NY
View Profile

,
Thomas A. Joseph

Cornell Univ., Ithaca, NY

Cornell Univ., Ithaca, NY
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 5 Issue 1pp 47–76https://doi.org/10.1145/7351.7478

Published:05 January 1987Publication History

ACM Transactions on Computer Systems

Abstract

The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistent orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.

References

1 BERNSTEIN, P. A., AND GOODMAN, N. Concurrency control in distributed database systems. ACM Comput. Surv. 13, 2 (June 1981), 185-221. Google Scholar
2 B{RMAN, K. Replication and availability in the ISIS system. In Proceedings o{ the l Oth A CM Symposium on Operating Systems Principles, Oper. Syst. Rev. 19, 5 (Dec. 1985), 79-86. Google Scholar
3 BIRMAN, K., JOSEPH, T., SCHMUCK, F., AND STEPHENSON, P. Programming with shared bulletin boards in asynchronous distributed systems. Tech. Rep. TR 86-772, Dept. of Computer Science, Cornell Univ., Aug. 1986. Google Scholar
4 CHANG, J., AND MAXEMCHUK, N.F. Reliable broadcast protocols. ACM Trans. Comput. Syst. 2, 3 (Aug. 1984), 251-273. Google Scholar
5 CHEmTON, D. R., AND ZWAENEPOEL, W. Distributed process groups in the V kernel. ACM Trans. Comput. Syst. 3, 2 (May 1985), 77-107. Google Scholar
6 COOPER, E. Replicated distributed programs. In Proceedings of the lOth ACM Symposium on Operating Systems Principles; Oper. Syst. Rev. 19, 5 (Dec. 1985), 63-78. Google Scholar
7 CRISTIAN, F., AGHILI, H., STRONG, R., AND DOLEV, D. Atomic broadcast: From simple message diffusion to Byzantine agreement. IBM Tech. Rep. RJ 4540 (48668), Oct. 1984.Google Scholar
8 EL ABBADI, A., AND TOUEG, S. Availability in partitioned, replicated databases. In Proceedings of the 5th ACM Symposium on Principles of Database Systems (Boston, Mass., Mar.). ACM, New York, 1986. Google Scholar
9 EL ABBADI, A., SKEEN, D., AND CRISTIAN, F. An efficient algorithm for replicated data management. In Proceedings of the 4th A CM Symposium on Principles of Database Systems (Portland, Oreg., Mar.). ACM, New York, 1985, pp. 215-229. Google Scholar
10 GRAY, J. Notes on database operating systems, in Operating Systems: An Advanced Course, G. Goos and J. Hartmannis, Eds. Lecture Notes in Computer Science, vol. 60. Springer-Verlag, New York, 1978. Google Scholar
11 GOODMAN, N., SKEEN, D., CHAN, A., DAYAL, U., FOX, S., AND RIES, D. A recovery algorithm for a distributed database system, in Proceedings of the 2nd A CM Symposium on Principles of Database Systems (Atlanta, Ga., Mar.). ACM, New York, 1983, pp. 8-15. Google Scholar
12 JOSEPH, T., AND BIRMAN, K. Low cost management of replicated data in fault-tolerant distributed systems. ACM Trans. Comput. Syst. 4, 1 (Feb. 1986), 54-70. Google Scholar
13 LAMPORT, L. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (july 1978), 558-565. Google Scholar
14 SCHLICHTING, R. D., AND SCHNEIDER, F.B. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst. 1, 3 (Aug. 1983), 222-238. Google Scholar
15 SCHNEIDER, F., GRIES, D., AND SCHLICHTING, R. Fault-tolerant broadcasts. Sci. Comput. Program. 4, 1 (Mar. 1984), 1-15. Google Scholar
16 SKEEN, D. Crash recovery in distributed database systems. Ph.D. dissertation, Dept. of Electrical Engineering and Computer Science, Univ. of California, Berkeley, 1980.Google Scholar
17 SKEEN, D. Determining the last process to fail. ACM Trans. Comput. Syst. 3, 1 (Feb. 1985), 15-30. Google Scholar

Index Terms

Reliable communication in the presence of failures

Recommendations

On Quiescent Reliable Communication

We study the problem of achieving reliable communication with quiescent algorithms (i.e., algorithms that eventually stop sending messages) in asynchronous systems with process crashes and lossy links. We first show that it is impossible to solve this ...
Read More
Reliable Communication in the Presence of Failures
Read More
On Quiescent Reliable Communication
Read More

Reviews

Reviewer: Andrew Robert Huber

The premise of this paper is that message orderings should be included in the communications layer of a distributed system. This approach is intended to maximize concurrency at the communications level, yet allow processes to determine (when desirable) the order in which messages will be seen by their recipients. Three protocols are presented that allow ordering of messages relative to process failures (and recoveries), as well as to other messages. (This only need be a partial ordering.) Neither shared memory nor closely synchronized clocks are needed for the protocols. Implementations of the protocols are described, including correctness arguments (although deadlock and livelock are omitted). This description includes some desirable optimizations. A brief performance section is unsatisfying; it presents a simple case for which the protocols do exhibit better performance (in elapsed time) than a synchronous version. Not addressed are the additional processor, memory, and communications loads imposed on the system. These may be significant, especially as system size grows, since 3 n or more messages will result from a process sending a message to n recipients; and in some cases, histories of messages must be maintained and transmitted. The work's value is in extending the use of cooperating processes as an organizing method for distributed systems to (1) fault-tolerant applications using highly concurrent or asynchronous algorithms, and (2) hierarchical distributed systems and simple LANs. Limitations include (1) the inability to handle partitions in the network, and (2) the point that only halting failures, and not incorrect operations, are permitted.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer Systems Volume 5, Issue 1
Feb. 1987
93 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/7351
Issue’s Table of Contents

Copyright © 1987 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 January 1987
Published in tocs Volume 5, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 840
  Total Citations
  View Citations
- 4,357
  Total Downloads
- Downloads (Last 12 months)269
- Downloads (Last 6 weeks)41
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Reliable communication in the presence of failures

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

On Quiescent Reliable Communication

Reliable Communication in the Presence of Failures

On Quiescent Reliable Communication

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Reliable communication in the presence of failures

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

On Quiescent Reliable Communication

Reliable Communication in the Presence of Failures

On Quiescent Reliable Communication

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media