skip to main content
research-article

Derecho: Fast State Machine Replication for Cloud Services

Published:02 April 2019Publication History
Skip Abstract Section

Abstract

Cloud computing services often replicate data and may require ways to coordinate distributed actions. Here we present Derecho, a library for such tasks. The API provides interfaces for structuring applications into patterns of subgroups and shards, supports state machine replication within them, and includes mechanisms that assist in restart after failures. Running over 100Gbps RDMA, Derecho can send millions of events per second in each subgroup or shard and throughput peaks at 16GB/s, substantially outperforming prior solutions. Configured to run purely on TCP, Derecho is still substantially faster than comparable widely used, highly-tuned, standard tools. The key insight is that on modern hardware (including non-RDMA networks), data-intensive protocols should be built from non-blocking data-flow components.

References

  1. {n. d.}. LibPaxos: Open-source Paxos. Retrieved from http://libpaxos.sourceforge.net/.Google ScholarGoogle Scholar
  2. {n. d.}. RDMA-Paxos: Open-source Paxos. Retrieved from https://github.com/wangchenghku/RDMA-PAXOS.Google ScholarGoogle Scholar
  3. 2011. Vsync Reliable Multicast Library. Retrieved from http://vsync.codeplex.com/.Google ScholarGoogle Scholar
  4. 2012. Gbcast Protocol. Retrieved from https://en.wikipedia.org/wiki/Gbcast.Google ScholarGoogle Scholar
  5. Mahesh Balakrishnan, Dahlia Malkhi, John D. Davis, Vijayan Prabhakaran, Michael Wei, and Ted Wobber. 2013. CORFU: A distributed shared log. ACM Trans. Comput. Syst. 31, 4, Article 10 (Dec. 2013), 24 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bela Ban. 2002. JGroups Reliable Multicast Library. Retrieved from http://jgroups.org/.Google ScholarGoogle Scholar
  7. Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 29--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Behrens, Jonathan and Birman, Ken and Jha, Sagar and Tremel, Edward. 2018. RDMC: A reliable RDMA multicast for large objects. In Proceedings of the 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’18). IEEE Computer Society, Los Alamitos, CA, 1--12.Google ScholarGoogle Scholar
  9. Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A protected dataplane operating system for high throughput and low latency. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, 49--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kenneth Birman. 2012. Guide to Reliable Distributed Systems. Number XXII in Texts in Computer Science. Springer-Verlag, London. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kenneth P. Birman. 1985. Replication and fault-tolerance in the isis system. In Proceedings of the 10th ACM Symposium on Operating Systems Principles (SOSP’85). ACM, New York, NY, 79--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kenneth P. Birman. 2012. Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services. Springer Verlag Texts in Computer Science, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kenneth P. Birman and Thomas A. Joseph. 1987. Exploiting virtual synchrony in distributed systems. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (SOSP’87). ACM, New York, NY, 123--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kenneth P. Birman and Thomas A. Joseph. 1987. Reliable communication in the presence of failures. ACM Trans. Comput. Syst. 5, 1 (Jan. 1987), 47--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Eric Brewer. 2010. A certain freedom: Thoughts on the CAP theorem. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC’10). ACM, New York, NY, 335--335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas. 2011. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). ACM, New York, NY, 143--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos made live: An engineering perspective. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing (PODC’07). ACM, New York, NY, 398--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (Mar. 1996), 225--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3, 1 (Feb. 1985), 63--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jo-Mei Chang and N. F. Maxemchuk. 1984. Reliable broadcast protocols. ACM Trans. Comput. Syst. 2, 3 (Aug. 1984), 251--273. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Neil Conway, William R. Marczak, Peter Alvaro, Joseph M. Hellerstein, and David Maier. 2012. Logic and lattices for distributed programming. In Proceedings of the 3rd ACM Symposium on Cloud Computing (SoCC’12). ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s globally-distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, Berkeley, CA, 251--264. http://dl.acm.org/citation.cfm?id=2387880.2387905 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Heming Cui, Rui Gu, Cheng Liu, Tianyu Chen, and Junfeng Yang. 2015. Paxos made transparent. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 105--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Huynh Tu Dang, Daniele Sciascia, Marco Canini, Fernando Pedone, and Robert Soulé. 2015. NetPaxos: Consensus at network speed. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research (SOSR’15). ACM, New York, NY, Article 5, 7 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI’14). USENIX Association, Berkeley, CA, 401--414. http://dl.acm.org/citation.cfm?id=2616448.2616486 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (Apr. 1985), 374--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Roy Friedman and Robbert van Renesse. 1997. Packing messages as a tool for boosting the performance of total ordering protocols. In Proceedings of the 6th IEEE International Symposium on High Performance Distributed Computing (HPDC’97). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Prasanna Ganesan and M. Seshadri. 2005. On cooperative content distribution and the price of barter. In Proceedings of the 25th IEEE International Conference on Distributed Computing Systems (ICDCS’05). 81--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (Jun. 2002), 51--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Rachid Guerraoui, Ron R. Levy, Bastian Pochon, and Vivien Quéma. 2010. Throughput optimal total order broadcast for cluster environments. ACM Trans. Comput. Syst. 28, 2, Article 5 (Jul. 2010), 32 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over commodity ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM’16). ACM, New York, NY, 202--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Joseph Y. Halpern and Yoram Moses. 1990. Knowledge and common knowledge in a distributed environment. J. ACM 37, 3 (Jul. 1990), 549--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA efficiently for key-value services. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM’14). ACM, New York, NY, 295--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Idit Keidar and Alexander Shraer. 2006. Timeliness, failure-detectors, and consensus performance. In Proceedings of the 25th Annual ACM Symposium on Principles of Distributed Computing (PODC’06). ACM, New York, NY, 169--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Leslie Lamport. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Butler Lampson. 2001. The ABCD’s of Paxos. In Proceedings of the 20th Annual ACM Symposium on Principles of Distributed Computing (PODC’01). ACM, New York, NY, 13--.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jialin Li, Ellis Michael, Naveen Kr Sharma, Adriana Szekeres, and Dan RK Ports. 2016. Just say NO to Paxos overhead: Replacing consensus with network ordering. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Parisa Jalili Marandi, Samuel Benz, Fernando Pedonea, and Kenneth P. Birman. 2014. The performance of paxos in the cloud. In Proceedings of the 2014 IEEE 33rd International Symposium on Reliable Distributed Systems (SRDS’14). IEEE Computer Society, Los Alamitos, CA, 41--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. D. Mazieres. 2007. Paxos made practical. Technical report. Retrieved from http://www.scs.stanford.edu/dm/home/papers.Google ScholarGoogle Scholar
  40. Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (USENIX ATC’13). USENIX Association, Berkeley, CA, 103--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’14). USENIX Association, Berkeley, CA, 305--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Diego Ongaro, Guru Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman. 2011. The case for RAMCloud. Commun. ACM 54, 7 (Jul. 2011), 121--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2014. Arrakis: The operating system is the control plane. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Marius Poke and Torsten Hoefler. 2015. DARE: High-performance state machine replication on RDMA networks. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’15). ACM, New York, NY, 107--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Marius Poke, Torsten Hoefler, and Colin Glass. 2017. AllConcur: Leaderless concurrent atomic broadcast. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’17). ACM, New York, NY, 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Roberto De Prisco, Butler W. Lampson, and Nancy A. Lynch. 1997. Revisiting the paxos algorithm. In Proceedings of the 11th International Workshop on Distributed Algorithms (WDAG’97). Springer-Verlag, London, 111--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Dan Pritchett. 2008. BASE: An ACID alternative. Queue 6, 3 (May 2008), 48--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Fred B. Schneider. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4 (Dec. 1990), 299--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Steve Shankland. 2008. Google’s Jeff Dean spotlights data center inner workings. C|Net Reviews (May 2008). Retrieved from https://www.cnet.com/news/google-spotlights-data-center-inner-workings/.Google ScholarGoogle Scholar
  50. Weijia Song, Theo Gkountouvas, Ken Birman, Qi Chen, and Zhen Xiao. 2016. The freeze-frame file system. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16). ACM, New York, NY, 307--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Robbert Van Renesse and Deniz Altinbuken. 2015. Paxos made moderately complex. ACM Comput. Surv. 47, 3, Article 42 (Feb. 2015), 36 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Robbert van Renesse and Fred B. Schneider. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Operating Systems Design 8 Implementation, Volume 6 (OSDI’04). USENIX Association, Berkeley, CA, USA, 7--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. T. von Eicken, A. Basu, V. Buch, and W. Vogels. 1995. U-Net: A user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95). ACM, New York, NY, 40--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Cheng Wang, Jianyu Jiang, Xusheng Chen, Ning Yi, and Heming Cui. 2017. APUS: Fast and scalable paxos on RDMA. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17). ACM, Santa Clara, CA, 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Michael Wei, Amy Tai, Christopher J. Rossbach, Ittai Abraham, Maithem Munshed, Medhavi Dhawan, Jim Stabile, Udi Wieder, Scott Fritchie, Steven Swanson, Michael J. Freedman, and Dahlia Malkhi. 2017. vCorfu: A cloud-scale object store on a shared log. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI’17). USENIX Association, 35--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast In-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 87--104. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Derecho: Fast State Machine Replication for Cloud Services

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Computer Systems
        ACM Transactions on Computer Systems  Volume 36, Issue 2
        May 2018
        112 pages
        ISSN:0734-2071
        EISSN:1557-7333
        DOI:10.1145/3323874
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 April 2019
        • Accepted: 1 December 2018
        • Revised: 1 July 2018
        • Received: 1 September 2017
        Published in tocs Volume 36, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format