Abstract
Cloud computing services often replicate data and may require ways to coordinate distributed actions. Here we present Derecho, a library for such tasks. The API provides interfaces for structuring applications into patterns of subgroups and shards, supports state machine replication within them, and includes mechanisms that assist in restart after failures. Running over 100Gbps RDMA, Derecho can send millions of events per second in each subgroup or shard and throughput peaks at 16GB/s, substantially outperforming prior solutions. Configured to run purely on TCP, Derecho is still substantially faster than comparable widely used, highly-tuned, standard tools. The key insight is that on modern hardware (including non-RDMA networks), data-intensive protocols should be built from non-blocking data-flow components.
- {n. d.}. LibPaxos: Open-source Paxos. Retrieved from http://libpaxos.sourceforge.net/.Google Scholar
- {n. d.}. RDMA-Paxos: Open-source Paxos. Retrieved from https://github.com/wangchenghku/RDMA-PAXOS.Google Scholar
- 2011. Vsync Reliable Multicast Library. Retrieved from http://vsync.codeplex.com/.Google Scholar
- 2012. Gbcast Protocol. Retrieved from https://en.wikipedia.org/wiki/Gbcast.Google Scholar
- Mahesh Balakrishnan, Dahlia Malkhi, John D. Davis, Vijayan Prabhakaran, Michael Wei, and Ted Wobber. 2013. CORFU: A distributed shared log. ACM Trans. Comput. Syst. 31, 4, Article 10 (Dec. 2013), 24 pages. Google ScholarDigital Library
- Bela Ban. 2002. JGroups Reliable Multicast Library. Retrieved from http://jgroups.org/.Google Scholar
- Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 29--44. Google ScholarDigital Library
- Behrens, Jonathan and Birman, Ken and Jha, Sagar and Tremel, Edward. 2018. RDMC: A reliable RDMA multicast for large objects. In Proceedings of the 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’18). IEEE Computer Society, Los Alamitos, CA, 1--12.Google Scholar
- Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A protected dataplane operating system for high throughput and low latency. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, 49--65. Google ScholarDigital Library
- Kenneth Birman. 2012. Guide to Reliable Distributed Systems. Number XXII in Texts in Computer Science. Springer-Verlag, London. Google ScholarDigital Library
- Kenneth P. Birman. 1985. Replication and fault-tolerance in the isis system. In Proceedings of the 10th ACM Symposium on Operating Systems Principles (SOSP’85). ACM, New York, NY, 79--86. Google ScholarDigital Library
- Kenneth P. Birman. 2012. Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services. Springer Verlag Texts in Computer Science, New York, NY. Google ScholarDigital Library
- Kenneth P. Birman and Thomas A. Joseph. 1987. Exploiting virtual synchrony in distributed systems. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (SOSP’87). ACM, New York, NY, 123--138. Google ScholarDigital Library
- Kenneth P. Birman and Thomas A. Joseph. 1987. Reliable communication in the presence of failures. ACM Trans. Comput. Syst. 5, 1 (Jan. 1987), 47--76. Google ScholarDigital Library
- Eric Brewer. 2010. A certain freedom: Thoughts on the CAP theorem. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC’10). ACM, New York, NY, 335--335. Google ScholarDigital Library
- Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas. 2011. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). ACM, New York, NY, 143--157. Google ScholarDigital Library
- Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos made live: An engineering perspective. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing (PODC’07). ACM, New York, NY, 398--407. Google ScholarDigital Library
- Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (Mar. 1996), 225--267. Google ScholarDigital Library
- K. Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3, 1 (Feb. 1985), 63--75. Google ScholarDigital Library
- Jo-Mei Chang and N. F. Maxemchuk. 1984. Reliable broadcast protocols. ACM Trans. Comput. Syst. 2, 3 (Aug. 1984), 251--273. Google ScholarDigital Library
- Neil Conway, William R. Marczak, Peter Alvaro, Joseph M. Hellerstein, and David Maier. 2012. Logic and lattices for distributed programming. In Proceedings of the 3rd ACM Symposium on Cloud Computing (SoCC’12). ACM, New York, NY. Google ScholarDigital Library
- James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s globally-distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, Berkeley, CA, 251--264. http://dl.acm.org/citation.cfm?id=2387880.2387905 Google ScholarDigital Library
- Heming Cui, Rui Gu, Cheng Liu, Tianyu Chen, and Junfeng Yang. 2015. Paxos made transparent. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 105--120. Google ScholarDigital Library
- Huynh Tu Dang, Daniele Sciascia, Marco Canini, Fernando Pedone, and Robert Soulé. 2015. NetPaxos: Consensus at network speed. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research (SOSR’15). ACM, New York, NY, Article 5, 7 pages. Google ScholarDigital Library
- Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI’14). USENIX Association, Berkeley, CA, 401--414. http://dl.acm.org/citation.cfm?id=2616448.2616486 Google ScholarDigital Library
- Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (Apr. 1985), 374--382. Google ScholarDigital Library
- Roy Friedman and Robbert van Renesse. 1997. Packing messages as a tool for boosting the performance of total ordering protocols. In Proceedings of the 6th IEEE International Symposium on High Performance Distributed Computing (HPDC’97). Google ScholarDigital Library
- Prasanna Ganesan and M. Seshadri. 2005. On cooperative content distribution and the price of barter. In Proceedings of the 25th IEEE International Conference on Distributed Computing Systems (ICDCS’05). 81--90. Google ScholarDigital Library
- Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (Jun. 2002), 51--59. Google ScholarDigital Library
- Rachid Guerraoui, Ron R. Levy, Bastian Pochon, and Vivien Quéma. 2010. Throughput optimal total order broadcast for cluster environments. ACM Trans. Comput. Syst. 28, 2, Article 5 (Jul. 2010), 32 pages. Google ScholarDigital Library
- Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over commodity ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM’16). ACM, New York, NY, 202--215. Google ScholarDigital Library
- Joseph Y. Halpern and Yoram Moses. 1990. Knowledge and common knowledge in a distributed environment. J. ACM 37, 3 (Jul. 1990), 549--587. Google ScholarDigital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA efficiently for key-value services. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM’14). ACM, New York, NY, 295--306. Google ScholarDigital Library
- Idit Keidar and Alexander Shraer. 2006. Timeliness, failure-detectors, and consensus performance. In Proceedings of the 25th Annual ACM Symposium on Principles of Distributed Computing (PODC’06). ACM, New York, NY, 169--178. Google ScholarDigital Library
- Leslie Lamport. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169. Google ScholarDigital Library
- Butler Lampson. 2001. The ABCD’s of Paxos. In Proceedings of the 20th Annual ACM Symposium on Principles of Distributed Computing (PODC’01). ACM, New York, NY, 13--.Google ScholarDigital Library
- Jialin Li, Ellis Michael, Naveen Kr Sharma, Adriana Szekeres, and Dan RK Ports. 2016. Just say NO to Paxos overhead: Replacing consensus with network ordering. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). Google ScholarDigital Library
- Parisa Jalili Marandi, Samuel Benz, Fernando Pedonea, and Kenneth P. Birman. 2014. The performance of paxos in the cloud. In Proceedings of the 2014 IEEE 33rd International Symposium on Reliable Distributed Systems (SRDS’14). IEEE Computer Society, Los Alamitos, CA, 41--50. Google ScholarDigital Library
- D. Mazieres. 2007. Paxos made practical. Technical report. Retrieved from http://www.scs.stanford.edu/dm/home/papers.Google Scholar
- Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (USENIX ATC’13). USENIX Association, Berkeley, CA, 103--114. Google ScholarDigital Library
- Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’14). USENIX Association, Berkeley, CA, 305--320. Google ScholarDigital Library
- John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Diego Ongaro, Guru Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman. 2011. The case for RAMCloud. Commun. ACM 54, 7 (Jul. 2011), 121--130. Google ScholarDigital Library
- Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2014. Arrakis: The operating system is the control plane. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, 1--16. Google ScholarDigital Library
- Marius Poke and Torsten Hoefler. 2015. DARE: High-performance state machine replication on RDMA networks. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’15). ACM, New York, NY, 107--118. Google ScholarDigital Library
- Marius Poke, Torsten Hoefler, and Colin Glass. 2017. AllConcur: Leaderless concurrent atomic broadcast. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’17). ACM, New York, NY, 18. Google ScholarDigital Library
- Roberto De Prisco, Butler W. Lampson, and Nancy A. Lynch. 1997. Revisiting the paxos algorithm. In Proceedings of the 11th International Workshop on Distributed Algorithms (WDAG’97). Springer-Verlag, London, 111--125. Google ScholarDigital Library
- Dan Pritchett. 2008. BASE: An ACID alternative. Queue 6, 3 (May 2008), 48--55. Google ScholarDigital Library
- Fred B. Schneider. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4 (Dec. 1990), 299--319. Google ScholarDigital Library
- Steve Shankland. 2008. Google’s Jeff Dean spotlights data center inner workings. C|Net Reviews (May 2008). Retrieved from https://www.cnet.com/news/google-spotlights-data-center-inner-workings/.Google Scholar
- Weijia Song, Theo Gkountouvas, Ken Birman, Qi Chen, and Zhen Xiao. 2016. The freeze-frame file system. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16). ACM, New York, NY, 307--320. Google ScholarDigital Library
- Robbert Van Renesse and Deniz Altinbuken. 2015. Paxos made moderately complex. ACM Comput. Surv. 47, 3, Article 42 (Feb. 2015), 36 pages. Google ScholarDigital Library
- Robbert van Renesse and Fred B. Schneider. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Operating Systems Design 8 Implementation, Volume 6 (OSDI’04). USENIX Association, Berkeley, CA, USA, 7--7. Google ScholarDigital Library
- T. von Eicken, A. Basu, V. Buch, and W. Vogels. 1995. U-Net: A user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95). ACM, New York, NY, 40--53. Google ScholarDigital Library
- Cheng Wang, Jianyu Jiang, Xusheng Chen, Ning Yi, and Heming Cui. 2017. APUS: Fast and scalable paxos on RDMA. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17). ACM, Santa Clara, CA, 14. Google ScholarDigital Library
- Michael Wei, Amy Tai, Christopher J. Rossbach, Ittai Abraham, Maithem Munshed, Medhavi Dhawan, Jim Stabile, Udi Wieder, Scott Fritchie, Steven Swanson, Michael J. Freedman, and Dahlia Malkhi. 2017. vCorfu: A cloud-scale object store on a shared log. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI’17). USENIX Association, 35--49. Google ScholarDigital Library
- Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast In-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 87--104. Google ScholarDigital Library
Index Terms
- Derecho: Fast State Machine Replication for Cloud Services
Recommendations
Odyssey: the impact of modern hardware on strongly-consistent replication protocols
EuroSys '21: Proceedings of the Sixteenth European Conference on Computer SystemsGet/Put Key-Value Stores (KVSes) rely on replication protocols to enforce consistency and guarantee availability. Today's modern hardware, with manycore servers and RDMA-capable networks, challenges the conventional wisdom on protocol design. In this ...
Hermes: A Fast, Fault-Tolerant and Linearizable Replication Protocol
ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating SystemsToday's datacenter applications are underpinned by datastores that are responsible for providing availability, consistency, and performance. For high availability in the presence of failures, these datastores replicate data across several nodes. This is ...
Consistent and automatic replica regeneration
Reducing management costs and improving the availability of large-scale distributed systems require automatic replica regeneration, that is, creating new replicas in response to replica failures. A major challenge to regeneration is maintaining ...
Comments