research-article

Public Access

The Synchronous Data Center

Authors:
Tian Yang

University of Pennsylvania

University of Pennsylvania
View Profile

,
Robert Gifford

University of Pennsylvania

University of Pennsylvania
View Profile

,
Andreas Haeberlen

University of Pennsylvania

University of Pennsylvania
View Profile

,
Linh Thi Xuan Phan

University of Pennsylvania

University of Pennsylvania
View Profile

HotOS '19: Proceedings of the Workshop on Hot Topics in Operating SystemsMay 2019Pages 142–148https://doi.org/10.1145/3317550.3321442

Published:13 May 2019Publication History

HotOS '19: Proceedings of the Workshop on Hot Topics in Operating Systems

Pages 142–148

ABSTRACT

Today, distributed systems are typically designed to be largely asynchronous. Designers assume that the network can drop or significantly delay messages at unpredictable times, that there is no way to know how quickly a node might process a message, or how soon it might respond, and that the clocks of different nodes are at most loosely synchronized. These assumptions are certainly safe, but they come at a price: many applications really do need predictable performance, which, on top of an asynchronous system, has to be approximated at great cost and with lots of redundancy, and many distributed protocols for asynchronous systems are much more complex and expensive than their synchronous counterparts.

The goal of this paper is to start a discussion about whether the asynchronous model is (still) the right choice for most distributed systems, especially ones that belong to a single administrative domain. We argue that 1) by using ideas from the CPS domain, it is technically feasible to build datacenter-scale systems that are fully synchronous, and that 2) such systems would have several interesting advantages over current designs.

References

NSF Cloud 3.0 workshop report. http://pages.cs.wisc.edu/~akella/cloud3workshopreport.pdf, Jan. 2018.Google Scholar
M. Aguilera and M. Walfish. No time for asynchrony. In Proc. HotOS, 2009. Google ScholarDigital Library
D. Alistarh, S. Gilbert, R. Guerraoui, and C. Travers. Of choices, failures and asynchrony: The many faces of set agreement. Algorithmica, 62(1):595--629, Feb 2012. Google ScholarDigital Library
S. Angel, H. Ballani, T. Karagiannis, G. O'Shea, and E. Thereska. End-to-end performance isolation through virtual datacenters. In Proc. OSDI, 2014. Google ScholarDigital Library
E. Arjomandi, M. J. Fischer, and N. A. Lynch. Efficiency of synchronous versus asynchronous distributed systems. J. ACM, 30(3):449--456, July 1983. Google ScholarDigital Library
M. Attariyan, M. Chow, and J. Flinn. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Proc. OSDI, Oct. 2012. Google ScholarDigital Library
P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang. Towards highly reliable enterprise network services via inference of multi-level dependencies. In Proc. SIGCOMM, 2007. Google ScholarDigital Library
G. Banga, P. Druschel, and J. C. Mogul. Resource containers: A new facility for resource management in server systems. In Proc. OSDI, 1999. Google ScholarDigital Library
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In Proc. OSDI, 2004. Google ScholarDigital Library
J. Brutlag. Speed matters for google web search. http://services.google.com/fh/files/blogs/google_delayexp.pdf.Google Scholar
T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225--267, Mar. 1996. Google ScholarDigital Library
A. Chatzieleftheriou, S. Legtchenko, H. Williams, and A. Rowstron. Larry: Practical network reconfigurability in the data center. In Proc. NSDI, 2018. Google ScholarDigital Library
A. Chen, W. B. Moore, H. Xiao, A. Haeberlen, L. T. X. Phan, M. Sherr, and W. Zhou. Detecting covert timing channels with time-deterministic replay. In Proc. OSDI, 2014. Google ScholarDigital Library
M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Proc. DSN, 2002. Google ScholarDigital Library
A. Clement, E. Wong, L. Alvisi, M. Dahlin, and M. Marchetti. Making Byzantine fault tolerant systems tolerate Byzantine faults. In Proc. NSDI, 2009. Google ScholarDigital Library
J. Corbett et al. Spanner: Google's globally-distributed database. In Proc. OSDI, 2012. Google ScholarDigital Library
A. R. Curtis, W. Kim, and P. Yalagandula. Mahout: Low-overhead datacenter traffic management using end-host-based elephant detection. In Proc. INFOCOM, 2011.Google ScholarCross Ref
T. Dai, J. He, X. Gu, and S. Lu. Understanding real-world timeout problems in cloud server systems. In Proc. IC2E, 2018.Google ScholarCross Ref
C. Dwork, N. Lynch, and L. Stockmeyer. Consensus in the presence of partial synchrony. J. ACM, 35(2), Apr. 1988. Google ScholarDigital Library
S. A. Edwards and E. A. Lee. The case for the precision timed (PRET) machine. In Proc. DAC, 2007. Google ScholarDigital Library
M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374--382, Apr. 1985. Google ScholarDigital Library
F. C. Freiling, R. Guerraoui, and P. Kuznetsov. The failure detector abstraction. ACM Comput. Surv., 43(2):9:1--9:40, Feb. 2011. Google ScholarDigital Library
S. Ghorbani, Z. Yang, P. B. Godfrey, Y. Ganjali, and A. Firoozshahian. DRILL: Micro load balancing for low-latency data center networks. In Proc. SIGCOMM, 2017. Google ScholarDigital Library
H. S. Gunawi, M. Hao, T. Leesatapornwongsa, T. Patana-anake, T. Do, J. Adityatama, K. J. Eliazar, A. Laksono, J. F. Lukman, V. Martin, and A. D. Satria. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proc. SoCC, 2014. Google ScholarDigital Library
A. Haeberlen and K. Elphinstone. User-level management of kernel memory. In Proc. ACSAC, 2003.Google ScholarCross Ref
K. Jang, J. Sherry, H. Ballani, and T. Moncaster. Silo: Predictable message latency in the cloud. In Proc. SIGCOMM, 2015. Google ScholarDigital Library
S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed diagnosis in enterprise networks. In Proc. SIGCOMM, Aug. 2009. Google ScholarDigital Library
R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat. Chronos: Predictable low latency for data center applications. In Proc. SoCC, 2012. Google ScholarDigital Library
P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom. Spectre attacks: Exploiting speculative execution. ArXiv 1801.01203, 2018.Google Scholar
L. Lamport, R. Shostak, and M. Pease. The Byzantine Generals Problem. ACM Trans. Program. Lang. Syst., 4(3):382--401, July 1982. Google ScholarDigital Library
K. S. Lee, H. Wang, V. Shrivastav, and H. Weatherspoon. Globally synchronized time via datacenter networks. In Proc. SIGCOMM, 2016. Google ScholarDigital Library
J. B. Leners, H. Wu, W.-L. Hung, M. K. Aguilera, and M. Walfish. Detecting failures in distributed systems with the Falcon Spy Network. In Proc. SOSP, 2011. Google ScholarDigital Library
M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, S. Mangard, P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg. Meltdown. ArXiv 1801.01207, 2018.Google Scholar
A. P. Markopoulou, F. A. Tobagi, and M. J. Karam. Assessment of VoIP quality over Internet backbones. In Proc. INFOCOM, 2002.Google ScholarCross Ref
K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: Distributed, low latency scheduling. In Proc. SOSP, 2013. Google ScholarDigital Library
J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, and H. Fugal. Fast-pass: A centralized "zero-queue" datacenter network. In Proc. SIGCOMM, 2014. Google ScholarDigital Library
J. Real and A. Crespo. Mode change protocols for real-time systems: A survey and a new proposal. Real-Time Systems, 26:161--197, 2004. Google ScholarDigital Library
W. Reda, L. Suresh, M. Canini, and S. Braithwaite. BRB: BetteR Batch scheduling to reduce tail latencies in cloud data stores. In Proc. SIGCOMM, 2015. Google ScholarDigital Library
T. Ristenpart, E. Tromer, H. Shacham, and S. Savage. Hey, you, get off of my cloud: Exploring information leakage in third-party compute clouds. In Proc. CCS, 2009. Google ScholarDigital Library
N. Shalom. Amazon found every 100ms of latency cost them 1% in sales. GigaSpaces blog, https://blog.gigaspaces.com/amazon-found-every-100ms-of-latency-cost-them-1-in-sales/, 2008.Google Scholar
I. Shin and I. Lee. Periodic resource model for compositional real-time guarantees. In Proc. RTSS, 2003. Google ScholarDigital Library
L. Suresh, M. Canini, S. Schmid, and A. Feldmann. C3: Cutting tail latency in cloud data stores via adaptive replica selection. In Proc. NSDI, 2015. Google ScholarDigital Library
V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger, G. A. Gibson, and B. Mueller. Safe and effective fine-grained TCP retransmissions for datacenter communication. In Proc. SIGCOMM, 2009. Google ScholarDigital Library
Y. Wang, H. Wang, A. Mahimkar, R. Alimi, Y. Zhang, L. Qiu, and Y. R. Yang. R3: Resilient routing reconfiguration. In Proc. SIGCOMM, 2010. Google ScholarDigital Library
R. Wilhelm et al. The worst-case execution-time problem. ACM Trans. Embed. Comput. Syst., 7(3):36:1--36:53, May 2008. Google ScholarDigital Library
S. Xi, C. Li, C. Lu, C. D. Gill, M. Xu, L. T. X. Phan, I. Lee, and O. Sokolsky. RT-OpenStack: CPU resource management for real-time cloud computing. In Proc. CLOUD, 2015. Google ScholarDigital Library
S. Xi, J. Wilson, C. Lu, and C. Gill. RT-Xen: Towards real-time hypervisor scheduling in Xen. In Proc. EMSOFT, 2011. Google ScholarDigital Library
K. Zarifis, R. Miao, M. Calder, E. Katz-Bassett, M. Yu, and J. Padhye. DIBS: Just-in-time congestion mitigation for data centers. In Proc. EuroSys, 2014. Google ScholarDigital Library
X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes. CPI 2: CPU performance isolation for shared compute clusters. In Proc. EuroSys, Apr. 2013. Google ScholarDigital Library
D. Zhuo, M. Ghobadi, R. Mahajan, K. Förster, A. Krishnamurthy, and T. E. Anderson. Understanding and mitigating packet corruption in data center networks. In Proc. SIGCOMM, 2017. Google ScholarDigital Library

Recommendations

Globally Asynchronous, Locally Synchronous Circuits: Overview and Outlook

For more than 20 years, significant research effort was concentrated on globally asynchronous, locally synchronous (GALS) design methodologies. But despite several successful implementations, GALS has had little impact on industry products. This article ...
Read More
An Analysis of the Composition of Synchronous Systems

Safety-critical embedded applications are often distributed. For example, software in an automotive control or in avionics control are distributed over a large number of distributed processors which are connected over some domain specific buses. ...
Read More
Synchronous vs. asynchronous unison
SSS'05: Proceedings of the 7th international conference on Self-Stabilizing Systems

This paper considers the self-stabilizing unison problem. The contribution of this paper is threefold. First, we establish that when any self-stabilizing asynchronous unison protocol runs in synchronous systems, it converges to synchronous unison if the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

HotOS '19: Proceedings of the Workshop on Hot Topics in Operating Systems
May 2019
227 pages
ISBN:9781450367271
DOI:10.1145/3317550

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 May 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 688
  Total Downloads
- Downloads (Last 12 months)61
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The Synchronous Data Center

HotOS '19: Proceedings of the Workshop on Hot Topics in Operating Systems

ABSTRACT

References

Cited By

Recommendations

Globally Asynchronous, Locally Synchronous Circuits: Overview and Outlook

An Analysis of the Composition of Synchronous Systems

Synchronous vs. asynchronous unison

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The Synchronous Data Center

HotOS '19: Proceedings of the Workshop on Hot Topics in Operating Systems

ABSTRACT

References

Cited By

Recommendations

Globally Asynchronous, Locally Synchronous Circuits: Overview and Outlook

An Analysis of the Composition of Synchronous Systems

Synchronous vs. asynchronous unison

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media