ABSTRACT
State machine replication (SMR) uses Paxos to enforce the same inputs for a program (e.g., Redis) replicated on a number of hosts, tolerating various types of failures. Unfortunately, traditional Paxos protocols incur prohibitive performance overhead on server programs due to their high consensus latency on TCP/IP. Worse, the consensus latency of extant Paxos protocols increases drastically when more concurrent client connections or hosts are added. This paper presents APUS, the first RDMA-based Paxos protocol that aims to be fast and scalable to client connections and hosts. APUS intercepts inbound socket calls of an unmodified server program, assigns a total order for all input requests, and uses fast RDMA primitives to replicate these requests concurrently.
We evaluated APUS on nine widely-used server programs (e.g., Redis and MySQL). APUS incurred a mean overhead of 4.3% in response time and 4.2% in throughput. We integrated APUS with an SMR system Calvin. Our Calvin-APUS integration was 8.2X faster than the extant Calvin-ZooKeeper integration. The consensus latency of APUS outperformed an RDMA-based consensus protocol by 4.9X. APUS source code and raw results are released on github.com/hku-systems/apus.
- 2001. An Introduction to the InfiniBand Architecture. http://buyya.com/superstorage/chap42.pdf. (2001).Google Scholar
- 2004. A tool for measuring memcached server performance. https://github.com/twitter/twemperf. (2004).Google Scholar
- 2004. clamscan - scan files and directories for viruses. http://linux.die.net/man/1/clamscan. (2004).Google Scholar
- 2004. SysBench: a system performance benchmark. http://sysbench.sourceforge.net. (2004).Google Scholar
- 2004. Yahoo! Cloud Serving Benchmark. https://github.com/brianfrankcooper/YCSB. (2004).Google Scholar
- 2011. Why the data center needs an operating system. https://cs.stanford.edu/~matei/papers/2011/hotcloud_datacenter_os.pdf. (2011).Google Scholar
- 2012. Data Plane Development Kit (DPDK). http://dpdk.org/. (2012).Google Scholar
- 2012. Mellanox Products: RDMA over Converged Ethernet (RoCE). http://www.mellanox.com/page/products_dyn?product_family=79. (2012).Google Scholar
- 2012. RDMA - iWARP. http://www.chelsio.com/nic/rdma-iwarp/. (2012).Google Scholar
- 2012. ZooKeeper. https://zookeeper.apache.org/. (2012).Google Scholar
- 2014. ab - Apache HTTP server benchmarking tool. http://httpd.apache.org/docs/2.2/programs/ab.html. (2014).Google Scholar
- 2017. MediaTomb - Free UPnP MediaServer. http://mediatomb.cc/. (2017).Google Scholar
- 2017. MySQL Database. http://www.mysql.com/. (2017).Google Scholar
- Gautam Altekar and Ion Stoica. 2009. ODR: output-deterministic replay for multicore debugging. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP '09). 193--206. Google ScholarDigital Library
- Yair Amir, Claudiu Danilov, Danny Dolev, Jonathan Kirsch, John Lane, Cristina Nita-Rotaru, Josh Olsen, and David Zage. 2010. Steward: Scaling Byzantine fault-tolerant replication to wide area networks. IEEE Transactions on Dependable and Secure Computing 7, 1 (2010), 80--93. Google ScholarDigital Library
- Amittai Aviram, Shu-Chun Weng, Sen Hu, and Bryan Ford. 2010. Efficient System-Enforced Deterministic Parallelism. In Proceedings of the Ninth Symposium on Operating Systems Design and Implementation (OSDI '10).Google ScholarDigital Library
- Bharath Balasubramanian and Vijay K. Garg. 2014. Fault Tolerance in Distributed Systems Using Fused State Machines. Distrib. Comput. (2014).Google Scholar
- Diogo Behrens, Dmitrii Kuvaiskii, and Christof Fetzer. 2014. HardPaxos: Replication Hardened against Hardware Errors. In Reliable Distributed Systems (SRDS), 2014 IEEE 33rd International Symposium on.Google ScholarDigital Library
- Jonathan Behrens, Ken Birman, Sagar Jha, Matthew Milano, Edward Tremel, Eugene Bagdasaryan, Theo Gkountouvas, Weijia Song, and Robbert van Renesse. 2016. Derecho: Group Communication at the Speed of Light. (2016).Google Scholar
- Carlos Eduardo Bezerra, Fernando Pedone, and Robbert Van Renesse. 2014. Scalable State-Machine Replication. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN '14). Google ScholarDigital Library
- Martin Biely, Zarko Milosevic, Nuno Santos, and Andre Schiper. 2012. S-Paxos: Offloading the Leader for High Throughput State Machine Replication. In Proceedings of the 2012 IEEE 31st Symposium on Reliable Distributed Systems (SRDS '12). Google ScholarDigital Library
- Yuriy Brun, George Edwards, Jae Young Bang, and Nenad Medvidovic. 2011. Smart Redundancy for Distributed Computation. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems (ICDCS '11). Google ScholarDigital Library
- Mike Burrows. 2006. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the Seventh Symposium on Operating Systems Design and Implementation (OSDI '06). 335--350.Google ScholarDigital Library
- Miguel Castro and Barbara Liskov. 1999. Practical Byzantine Fault Tolerance. In Proceedings of the Third Symposium on Operating Systems Design and Implementation (OSDI '99).Google ScholarDigital Library
- Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos Made Live: An Engineering Perspective. In Proceedings of the Twenty-sixth Annual ACM Symposium on Principles of Distributed Computing (PODC '07). Google ScholarDigital Library
- Clam AntiVirus 2017. http://www.clamav.net/. (2017).Google Scholar
- James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google's Globally-distributed Database. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI '16).Google ScholarDigital Library
- criu 2015. CRIU. http://criu.org. (2015).Google Scholar
- Heming Cui, Rui Gu, Cheng Liu, and Junfeng Yang. 2015. Paxos Made Transparent. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15). Google ScholarDigital Library
- Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum, Xinan Xu, Junfeng Yang, Garth A. Gibson, and Randal E. Bryant. 2013. Parrot: a Practical Runtime for Deterministic, Stable, and Reliable Threads. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP '13). Google ScholarDigital Library
- Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm Hutchinson, and Andrew Warfield. 2008. Remus: High availability via asynchronous virtual machine replication. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation. San Francisco, 161--174.Google ScholarDigital Library
- Huynh Tu Dang, Marco Canini, Fernando Pedone, and Robert Soulé. 2016. Paxos made switch-y. ACM SIGCOMM Computer Communication Review 46, 1 (2016), 18--24. Google ScholarDigital Library
- Huynh Tu Dang, Daniele Sciascia, Marco Canini, Fernando Pedone, and Robert Soulé. 2015. NetPaxos: Consensus at Network Speed. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research (SOSR '15). Google ScholarDigital Library
- Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast Remote Memory. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI'14).Google ScholarDigital Library
- Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No Compromises: Distributed Transactions with Consistency, Availability, and Performance. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15). Google ScholarDigital Library
- Message Passing Interface Forum. 2009. Open MPI: Open Source High Performance Computing. (Sept. 2009).Google Scholar
- Lisa Glendenning, Ivan Beschastnikh, Arvind Krishnamurthy, and Thomas Anderson. 2011. Scalable Consistency in Scatter. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP '11). Google ScholarDigital Library
- Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over commodity ethernet at scale. In Proceedings of the 2016 conference on ACM SIGCOMM 2016 Conference. ACM, 202--215. Google ScholarDigital Library
- Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. 2011. Practical Software Model Checking via Dynamic Interface Reduction. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP '11). 265--278. Google ScholarDigital Library
- Zhenyu Guo, Chuntao Hong, Mao Yang, Dong Zhou, Lidong Zhou, and Li Zhuang. 2014. Rex: Replication at the Speed of Multi-core. In Proceedings of the 2014 ACM European Conference on Computer Systems (EUROSYS '14). ACM, 11. Google ScholarDigital Library
- Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX conference on Networked Systems Design and Implementation (NSDI'11). USENIX Association, Berkeley, CA, USA.Google ScholarDigital Library
- Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (USENIX-ATC'10).Google ScholarDigital Library
- Dang Huynh Tu, Bressana Pietro, Wang Han, Lee Ki Shu, Weatherspoon Hakim, Canini Marco, Pedone Fernando, and Soule Robert. 2016. Network Hardware-Accelerated Consensus. Technical Report. USI Technical Report Series in Informatics.Google Scholar
- Zsolt István, David Sidler, Gustavo Alonso, and Marko Vukolic. 2016. Consensus in a Box: Inexpensive Coordination in Hardware. In Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation (NSDI'16).Google ScholarDigital Library
- Jithin Jose, Hari Subramoni, Krishna Kandalla, Md. Wasi-ur Rahman, Hao Wang, Sundeep Narravula, and Dhabaleswar K. Panda. 2012. Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports. In Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012) (CCGRID '12).Google Scholar
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA Efficiently for Key-value Services. (Aug. 2014).Google Scholar
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI '16).Google Scholar
- Manos Kapritsos and Flavio P. Junqueira. 2010. Scalable Agreement: Toward Ordering As a Service. In Proceedings of the Sixth International Conference on Hot Topics in System Dependability (HotDep'10).Google ScholarDigital Library
- Manos Kapritsos, Yang Wang, Vivien Quema, Allen Clement, Lorenzo Alvisi, Mike Dahlin, et al. 2012. All about Eve: Execute-Verify Replication for Multi-Core Servers.. In Proceedings of the Tenth Symposium on Operating Systems Design and Implementation (OSDI '12), Vol. 12. 237--250.Google Scholar
- Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam, and George Candea. 2015. Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-production Failures. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15). Google ScholarDigital Library
- Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Edmund Wong. 2007. Zyzzyva: Speculative Byzantine Fault Tolerance. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP '07). Google ScholarDigital Library
- Sriram Krishnan. 2010. Programming Windows Azure: Programming the Microsoft Cloud.Google Scholar
- Leslie Lamport. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (1998), 133--169. Google ScholarDigital Library
- Leslie Lamport. 2001. Paxos made simple. http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf. (2001).Google Scholar
- Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, and Dan R. K. Ports. 2016. Fast Replication with NOPaxos: Replacing Consensus with Network Ordering. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI '16).Google Scholar
- Tongping Liu, Charlie Curtsinger, and Emery D. Berger. 2011. DTHREADS: efficient deterministic multithreading. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP '11). 327--336. Google ScholarDigital Library
- Yanhua Mao, Flavio Paiva Junqueira, and Keith Marzullo. 2008. Mencius: building efficient replicated state machines for WANs. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, Vol. 8. 369--384.Google Scholar
- Parisa Jalili Marandi, Carlos Eduardo Bezerra, and Fernando Pedone. 2014. Rethinking State-Machine Replication for Parallelism. In Proceedings of the 2014 IEEE 34th International Conference on Distributed Computing Systems (ICDCS '14). Google ScholarDigital Library
- Rolando Martins, Rajeev Gandhi, Priya Narasimhan, Soila Pertet, António Casimiro, Diego Kreutz, and Paulo Veríssimo. 2013. Experiences with fault-injection in a Byzantine fault-tolerant protocol. In ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing. Springer, 41--61. Google ScholarCross Ref
- David Mazieres. 2007. Paxos made practical. Technical Report. Technical report, 2007. http://www.scs.stanford.edu/dm/home/papers.Google Scholar
- Hein Meling, Keith Marzullo, and Alessandro Mei. 2012. When You Don'T Trust Clients: Byzantine Proposer Fast Paxos. In Proceedings of the 2012 IEEE 32Nd International Conference on Distributed Computing Systems (ICDCS '12).Google ScholarDigital Library
- Memcached 2017. https://memcached.org/. (2017).Google Scholar
- Ellis Michael. 2015. Scaling Leader-Based Protocols for State Machine Replication. Ph.D. Dissertation. University of Texas at Austin.Google Scholar
- Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using One-sided RDMA Reads to Build a Fast, CPU-efficient Key-value Store. In Proceedings of the USENIX Annual Technical Conference (USENIX '14).Google ScholarDigital Library
- mongodb 2017. MongoDB. http://www.mongodb.org. (2017).Google Scholar
- Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is More Consensus in Egalitarian Parliaments. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP '91). Google ScholarDigital Library
- Nginx 2012. Nginx Web Server. https://nginx.org/. (2012).Google Scholar
- Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In Proceedings of the USENIX Annual Technical Conference (USENIX '14).Google ScholarDigital Library
- Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. 2011. Fast Crash Recovery in RAMCloud. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP '11). Google ScholarDigital Library
- OpenLDAP 2017. OpenLDAP. (2017). https://www.openldap.org/Google Scholar
- Soyeon Park, Yuanyuan Zhou, Weiwei Xiong, Zuoning Yin, Rini Kaushik, Kyu H. Lee, and Shan Lu. 2009. PRES: probabilistic replay with execution sketching on multiprocessors. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP '09). 177--192. Google ScholarDigital Library
- Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2014. Arrakis: The Operating System is the Control Plane. In Proceedings of the Eleventh Symposium on Operating Systems Design and Implementation (OSDI '14).Google ScholarDigital Library
- Marius Poke and Torsten Hoefler. 2015. DARE: High-Performance State Machine Replication on RDMA Networks. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15). Google ScholarDigital Library
- Dan R. K. Ports, Jialin Li, Vincent Liu, Naveen Kr. Sharma, and Arvind Krishnamurthy. 2015. Designing Distributed Systems Using Approximate Synchrony in Data Center Networks. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation (NSDI'15).Google ScholarDigital Library
- Marco Primi. 2016. LibPaxos. http://libpaxos.sourceforge.net/. (2016).Google Scholar
- Redis 2017. http://redis.io/. (2017).Google Scholar
- SSDB 2017. ssdb.io/. (2017).Google Scholar
- Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2014. Fast Distributed Transactions and Strongly Consistent Replication for OLTP Database Systems. (May 2014).Google Scholar
- Robbert Van Renesse and Deniz Altinbuken. 2015. Paxos Made Moderately Complex. ACM Computing Surveys (CSUR) 47, 3 (2015), 42:1--42:36.Google Scholar
- Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast In-memory Transaction Processing Using RDMA and HTM. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15) (SOSP '15).Google ScholarDigital Library
- Garth Gibson Wittawat Tantisiriroj. 2008. Network File System (NFS) in High Performance Networks. Technical Report CMU-PDLSVD08-02. Carnegie Mellon University.Google Scholar
- Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the Sixth Symposium on Networked Systems Design and Implementation (NSDI '09). 213--228.Google ScholarDigital Library
- Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony D. Joesph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. The Datacenter Needs an Operating System. In Proceedings of the 3rd USENIX Conference on Hot Topics in Cloud Computing.Google ScholarDigital Library
- Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion control for large-scale RDMA deployments. In ACM SIGCOMM Computer Communication Review, Vol. 45. ACM, 523--536. Google ScholarDigital Library
Index Terms
- APUS: fast and scalable paxos on RDMA
Recommendations
Towards highly-concurrent leaderless state machine replication for distributed systems
AbstractState Machine Replication (SMR) is a fault-tolerant service implementation technique used by many modern Internet services. A single leader is used in classic SMR to order all state machine commands. Due to the scalability and ...
Fast In-Memory Transaction Processing Using RDMA and HTM
DrTM is a fast in-memory transaction processing system that exploits advanced hardware features such as remote direct memory access (RDMA) and hardware transactional memory (HTM). To achieve high efficiency, it mostly offloads concurrency control such ...
A server bypass architecture for hopscotch hashing key–value store on DRAM-NVM memories
AbstractNon-volatile memories (NVMs), along with DRAMs, provide key–value (KV) stores with strong support in persisting data and storing it in memories. The remote direct memory access (RDMA) technology has been employed to boost remote data ...
Comments