ABSTRACT
Storage systems in data centers are an important component of large-scale online services. They typically perform replicated transactional operations for high data availability and integrity. Today, however, such operations suffer from high tail latency even with recent kernel bypass and storage optimizations, and thus affect the predictability of end-to-end performance of these services. We observe that the root cause of the problem is the involvement of the CPU, a precious commodity in multi-tenant settings, in the critical path of replicated transactions. In this paper, we present HyperLoop, a new framework that removes CPU from the critical path of replicated transactions in storage systems by offloading them to commodity RDMA NICs, with non-volatile memory as the storage medium. To achieve this, we develop new and general NIC offloading primitives that can perform memory operations on all nodes in a replication group while guaranteeing ACID properties without CPU involvement. We demonstrate that popular storage applications can be easily optimized using our primitives. Our evaluation results with microbenchmarks and application benchmarks show that HyperLoop can reduce 99th percentile latency ≈ 800X with close to 0% CPU consumption on replicas.
- 2008. MongoDB Managed Chain Replication. https://docs.mongodb.com/manual/tutorial/manage-chained-replication/. Accessed on 2018-01-25.Google Scholar
- 2013. The SO REUSEPORT socket option. https://lwn.net/Articles/542629/. Accessed on 2018-01-25.Google Scholar
- 2013. Transactions for AWS Dynamo DB. https://aws.amazon.com/blogs/aws/dynamodb-transaction-library/. Accessed on 2018-01-25.Google Scholar
- 2014. Replication in AWS Dynamo DB. https://aws.amazon.com/dynamodb/faqs/#scale_anchor. Accessed on 2018-01-25.Google Scholar
- 2015. InfiniBand Architecture Volume 1, Release 1.3. http://www.infinibandta.org/content/pages.php?pg=technology_public_specification. Accessed on 2018-01-25.Google Scholar
- 2015. Intel/Micron 3D-Xpoint Non-Volatile Main Memory. https://www.intel.com/content/www/us/en/architecture-and-technology/intel-micron-3d-xpoint-webcast.html. Accessed on 2018-01-25.Google Scholar
- 2016. Intel Storage Performance Development Kit. https://software.intel.com/en-us/articles/introduction-to-the-storage-performance-development-kit-spdk. Accessed on 2018-01-25.Google Scholar
- 2016. Replication in Google Cloud Datastore. https://cloud.google.com/datastore/docs/concepts/overview. Accessed on 2018-01-25.Google Scholar
- 2016. Transactions in Google Cloud Datastore. https://cloud.google.com/appengine/docs/standard/java/datastore/transactions. Accessed on 2018-01-25.Google Scholar
- 2017. Chain Replication in SAP HANA. https://www.sap.com/documents/2013/10/26c02b58-5a7c-0010-82c7-eda71af511fa.html. Accessed on 2018-01-25.Google Scholar
- 2017. HP Enterprise Non-Volatile DRAM. https://www.hpe.com/us/en/servers/persistent-memory.html. Accessed on 2018-01-25.Google Scholar
- 2017. Intel Persistent Memory Development Kit. http://pmem.io/pmdk/. Accessed on 2018-01-25.Google Scholar
- 2017. Oracle NVM Programming APIs. https://github.com/oracle/nvm-direct. Accessed on 2018-01-25.Google Scholar
- 2017. Persistent Memory KV Java Embedding. https://github.com/pmem/pmemkv-java. Accessed on 2018-01-25.Google Scholar
- 2017. Persistent Memory Optimizations for MySQL. http://pmem.io/2015/06/02/obj-mysql.html. Accessed on 2018-01-25.Google Scholar
- 2017. Persistent Memory Optimizations for Redis. https://libraries.io/github/pmem/redis. Accessed on 2018-01-25.Google Scholar
- 2017. Replication in Azure Cosmod DB. https://azure.microsoft.com/en-us/support/legal/sla/cosmos-db/v1_0/. Accessed on 2018-01-25.Google Scholar
- 2017. Transactions in Azure Cosmos DB. https://docs.microsoft.com/en-us/azure/cosmos-db/programming#database-program-transactions. Accessed on 2018-01-25.Google Scholar
- 2018. Amazon Relational Database Service (RDS) - AWS. https://aws.amazon.com/rds/. Accessed on 2018-01-25.Google Scholar
- 2018. Amazon Simple Storage Service (S3) - Cloud Storage - AWS. https://aws.amazon.com/s3/. Accessed on 2018-01-25.Google Scholar
- 2018. Azure Cosmos DB. https://azure.microsoft.com/en-us/services/cosmos-db. Accessed on 2018-01-25.Google Scholar
- 2018. Azure Storage - Secure cloud storage | Microsoft Azure. https://azure.microsoft.com/en-us/services/storage/. Accessed on 2018-01-25.Google Scholar
- 2018. Bigtable - Scalable NoSQL Database Service | Google Cloud Platform. https://cloud.google.com/bigtable/. Accessed on 2018-01-25.Google Scholar
- 2018. Cloud SQL - Google Cloud Platform. https://cloud.google.com/sql/. Accessed on 2018-01-25.Google Scholar
- 2018. CORE-Direct The Most Advanced Technology for MPI/SHMEM Collectives Offloads. http://www.mellanox.com/related-docs/whitepapers/TB_CORE-Direct.pdf. Accessed on 2018-01-25.Google Scholar
- 2018. Intel Data Plane Development Kit. http://dpdk.org/. Accessed on 2018-01-25.Google Scholar
- 2018. Memcache: A Distributed Memory Object Caching System. https://memcached.org/. Accessed on 2018-01-2015.Google Scholar
- 2018. Messaging Accelerator (VMA). http://www.mellanox.com/page/software_vma?mtag=vma. Accessed on 2018-01-25.Google Scholar
- 2018. Micron Userspace NVMe/SSD Library. https://github.com/MicronSSD/unvme. Accessed on 2018-01-25.Google Scholar
- 2018. Microsoft SQL Server Two-Phase Commit. https://msdn.microsoft.com/en-us/library/aa754091(v=bts.10).aspx. Accessed on 2018-01-25.Google Scholar
- 2018. MMAPv1 Storage Engine - MongoDB Manual. https://docs.mongodb.com/manual/core/mmapv1/. Accessed on 2018-01-12.Google Scholar
- 2018. MongoDB. https://www.mongodb.com/. Accessed on 2018-01-25.Google Scholar
- 2018. MongoDB Two-Phase Commits. https://docs.mongodb.com/manual/tutorial/perform-two-phase-commits/. Accessed on 2018-01-25.Google Scholar
- 2018. Oracle Database Two-Phase Commit Mechanism. https://docs.oracle.com/cd/B28359_01/server.111/b28310/ds_txns003.htm#ADMIN12222. Accessed on 2018-01-25.Google Scholar
- 2018. Oracle MySQL/MariaDB Chain Replication Option. https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html. Accessed on 2018-01-25.Google Scholar
- 2018. Persistent Disk - Persistent Local Storage | Google Cloud Platform. https://cloud.google.com/persistent-disk/. Accessed on 2018-01-25.Google Scholar
- 2018. redis. https://redis.io/. Accessed on 2018-01-25.Google Scholar
- 2018. Replication - MongoDB Manual. https://docs.mongodb.com/manual/replication/. Accessed on 2018-01-12.Google Scholar
- 2018. RocksDB. http://rocksdb.org/. Accessed on 2018-01-25.Google Scholar
- 2018. Seastar. http://www.seastar-project.org/. Accessed on 2018--01-25.Google Scholar
- 2018. SQL Database - Cloud Database as a Service | Microsoft Azure. https://azure.microsoft.com/en-us/services/sql-database/. Accessed on 2018-01-25.Google Scholar
- 2018. Stress-ng. http://kernel.ubuntu.com/~cking/stress-ng/. Accessed on 2018-01-25.Google Scholar
- 2018. Table storage. https://azure.microsoft.com/en-us/services/storage/tables/. Accessed on 2018-01-25.Google Scholar
- 2018. Two-Phase Commit Protocol. https://en.wikipedia.org/wiki/Two-phase_commit_protocol. Accessed on 2018-01-25.Google Scholar
- Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. 1997. Heartbeat: A timeout-free failure detector for quiescent reliable communication. In International Workshop on Distributed Algorithms. Springer, 126--140. Google ScholarDigital Library
- Sérgio Almeida, João Leitão, and Luís Rodrigues. 2013. ChainReaction: A Causal+ Consistent Datastore Based on Chain Replication. In ACM EuroSys (2013). Google ScholarDigital Library
- David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A Fast Array of Wimpy Nodes. In ACM SOSP (2009). Google ScholarDigital Library
- Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D. Davis. 2012. CORFU: A Shared Log Design for Flash Clusters. In USENIX NSDI (2012). Google ScholarDigital Library
- Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A Protected Dataplane Operating System for High Throughput and Low Latency. In USENIX OSDI (2014). Google ScholarDigital Library
- K. Birman and T. Joseph. 1987. Exploiting Virtual Synchrony in Distributed Systems. ACM SIGOPS Oper. Syst. Rev. 21, 5 (1987). Google ScholarDigital Library
- Matias Bjørling, Javier Gonzalez, and Philippe Bonnet. 2017. Light-NVM: The Linux Open-Channel SSD Subsystem. In USENIX FAST (2017). Google ScholarDigital Library
- Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. 2013. TAO: Facebook's Distributed Data Store for the Social Graph. In USENIX ATC (2013). Google ScholarDigital Library
- Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas. 2011. Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. In ACM SOSP (2011). Google ScholarDigital Library
- Adrian M. Caulfield, Todor I. Mollov, Louis Alex Eisner, Arup De, Joel Coburn, and Steven Swanson. 2012. Providing Safe, User Space Access to Fast, Solid State Disks. In ACM ASPLOS (2012). Google ScholarDigital Library
- Adrian M. Caulfield and Steven Swanson. 2013. QuickSAN: A Storage Area Network for Fast, Distributed, Solid State Disks. In ACM/IEEE ISCA (2013). Google ScholarDigital Library
- Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In ACM SoCC (2010). Google ScholarDigital Library
- James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. Spanner: Google's Globally Distributed Database. ACM Trans. Comput. Syst. 31, 3, Article 8 (Aug. 2013). Google ScholarDigital Library
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Siva-subramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon's Highly Available Key-value Store. In ACM SOSP (2007). Google ScholarDigital Library
- Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast Remote Memory. In USENIX NSDI (2014). Google ScholarDigital Library
- Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No Compromises: Distributed Transactions with Consistency, Availability, and Performance. In ACM SOSP (2015). Google ScholarDigital Library
- Subramanya R. Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. 2014. System Software for Persistent Memory. In ACM EuroSys (2014). Google ScholarDigital Library
- Robert Escriva, Bernard Wong, and Emin GÃijn Sirer. 2012. HyperDex: A Distributed, Searchable Key-value Store. In ACM SIGCOMM (2012). Google ScholarDigital Library
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In ACM SOSP (2003). Google ScholarDigital Library
- Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over Commodity Ethernet at Scale. In ACM SIGCOMM (2016). Google ScholarDigital Library
- Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. 2012. MegaPipe: A New Programming Interface for Scalable Network I/O. In USENIX OSDI (2012). Google ScholarDigital Library
- Jian Huang, Anirudh Badam, Laura Caulfield, Suman Nath, Sudipta Sengupta, Bikash Sharma, and Moinuddin K. Qureshi. 2017. Flash-Blox: Achieving Both Performance Isolation and Uniform Lifetime for Virtualized SSDs. In USENIX FAST (2017). Google ScholarDigital Library
- Jian Huang, Anirudh Badam, Laura Caulfield, Suman Nath, Sudipta Sengupta, Bikash Sharma, and Moinuddin K. Qureshi. 2017. FlashBox: Achieving Both Performance Isolation and Uniform Lifetime for Virtualized SSDs. In USENIX FAST (2017). Google ScholarDigital Library
- Jian Huang, Anirudh Badam, Moinuddin K. Qureshi, and Karsten Schwan. 2015. Unified Address Translation for Memory-mapped SSDs with FlashMap. In ACM/IEEE ISCA (2015). Google ScholarDigital Library
- EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. 2014. mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems. In USENIX NSDI (2014). Google ScholarDigital Library
- Sagar Jha, Jonathan Behrens, Theo Gkountouvas, Matthew Milano, Weijia Song, Edward Tremel, Sydney Zink, Ken Birman, and Robbert Van Renesse. 2017. Building Smart Memories and Cloud Services with Derecho. In Technical Report (2017).Google Scholar
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA Efficiently for Key-value Services. In ACM SIGCOMM (2014). Google ScholarDigital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs. In USENIX OSDI (2016). Google ScholarDigital Library
- Rajat Kateja, Anirudh Badam, Sriram Govindan, Bikash Sharma, and Greg Ganger. 2017. Viyojit: Decoupling Battery and DRAM Capacities for Battery-Backed DRAM. In ACM/IEEE ISCA (2017). Google ScholarDigital Library
- Ana Klimovic, Heiner Litz, and Christos Kozyrakis. 2017. ReFlex: Remote Flash ≈ Local Flash. In ACM ASPLOS (2017). Google ScholarDigital Library
- Avinash Lakshman and Prashant Malik. 2010. Cassandra: A Decentralized Structured Storage System. SIGOPS Oper. Syst. Rev. 44, 2 (April 2010). Google ScholarDigital Library
- Changman Lee, Dongho Sim, Jooyoung Hwang, and Sangyeun Cho. 2015. F2FS: A New File System for Flash Storage. In USENIX FAST (2015). Google ScholarDigital Library
- Xiaofeng Lin, Yu Chen, Xiaodong Li, Junjie Mao, Jiaquan He, Wei Xu, and Yuanchun Shi. 2016. Scalable Kernel TCP Design and Implementation for Short-Lived Connections. In ACM ASPLOS (2016). Google ScholarDigital Library
- Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. 2017. Octopus: an RDMA-enabled Distributed Persistent Memory File System. In USENIX ATC (2017). Google ScholarDigital Library
- Ilias Marinos, Robert N.M. Watson, and Mark Handley. 2014. Network Stack Specialization for Performance. In ACM SIGCOMM (2014). Google ScholarDigital Library
- Mellanox. 2018. RDMA Aware Networks Programming User Manual. http://www.mellanox.com/. Accessed on 2018-01-25.Google Scholar
- Amirsaman Memaripour, Anirudh Badam, Amar Phanishayee, Yanqi Zhou, Ramnatthan Alagappan, Karin Strauss, and Steven Swanson. 2017. Atomic In-place Upyears for Non-volatile Main Memories with Kamino-Tx. In ACM EuroSys (2017). Google ScholarDigital Library
- Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store. In USENIX ATC (2013). Google ScholarDigital Library
- C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. 1992. ARIES: A Transaction Recovery Method Supporting Fine-granularity Locking and Partial Rollbacks Using Write-ahead Logging. ACM Trans. Database Syst. 17, 1 (March 1992). Google ScholarDigital Library
- Mihir Nanavati, Jake Wires, and Andrew Warfield. 2017. Decibel: Isolation and Sharing in Disaggregated Rack-Scale Storage. In USENIX NSDI (2017). Google ScholarDigital Library
- Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. 2011. Fast Crash Recovery in RAMCloud. In ACM SOSP (2011). Google ScholarDigital Library
- Daniel Peng and Frank Dabek. 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications. In USENIX OSDI (2010). Google ScholarDigital Library
- Aleksey Pesterev, Jacob Strauss, Nickolai Zeldovich, and Robert T. Morris. 2012. Improving Network Connection Locality on Multicore Systems. In ACM EuroSys (2012). Google ScholarDigital Library
- Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2014. Arrakis: The Operating System is the Control Plane. In USENIX OSDI (2014). Google ScholarDigital Library
- Amar Phanishayee, David G. Andersen, Himabindu Pucha, Anna Povzner, and Wendy Belluomini. 2012. Flex-KV: Enabling High-performance and Flexible KV Systems. In ACM MBDS (2012). Google ScholarDigital Library
- Luigi Rizzo. 2012. netmap: A Novel Framework for Fast Packet I/O. In USENIX ATC (2012). Google ScholarDigital Library
- Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. 2017. Distributed Shared Persistent Memory. In ACM SoCC (2017). Google ScholarDigital Library
- Dharma Shukla, Shireesh Thota, Karthik Raman, Madhan Gajendran, Ankur Shah, Sergii Ziuzin, Krishnan Sundaram, Miguel Gonzalez Guajardo, Anna Wawrzyniak, Samer Boshra, Renato Ferreira, Mohamed Nassar, Michael Koltachev, Ji Huang, Sudipta Sengupta, Justin Levandoski, and David Lomet. 2015. Schema-agnostic Indexing with Azure DocumentDB. Proc. VLDB Endow. 8, 12 (Aug. 2015). Google ScholarDigital Library
- Amy Tai, Michael Wei, Michael J. Freedman, Ittai Abraham, and Dahlia Malkhi. 2016. Replex: A Scalable, Highly Available Multi-index Data Store. In USENIX ATC (2016). Google ScholarDigital Library
- Jeff Terrace and Michael J. Freedman. 2009. Object Storage on CRAQ: High-throughput Chain Replication for Read-mostly Workloads. In USENIX ATC (2009). Google ScholarDigital Library
- Robbert van Renesse and Fred B. Schneider. 2004. Chain Replication for Supporting High Throughput and Availability. In USENIX OSDI (2004). Google ScholarDigital Library
- Ymir Vigfusson, Hussam Abu-Libdeh, Mahesh Balakrishnan, Ken Birman, Robert Burgess, Gregory Chockler, Haoyuan Li, and Yoav Tock. 2010. Dr. Multicast: Rx for Data Center Communication Scalability. In ACM EuroSys (2010). Google ScholarDigital Library
- Jian Xu and Steven Swanson. 2016. NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories. In USENIX FAST (2016). Google ScholarDigital Library
- Jisoo Yang, Dave B. Minturn, and Frank Hady. 2012. When Poll is Better Than Interrupt. In USENIX FAST (2012). Google ScholarDigital Library
- Yiying Zhang, Leo Prasath Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. De-indirection for Flash-based SSDs with Nameless Writes. In USENIX FAST (2012). Google ScholarDigital Library
- Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. 2015. Mojim: A reliable and highly-available non-volatile memory system. In ACM ASPLOS (2015). Google ScholarDigital Library
- Yanqi Zhou, Ramnatthan Alagappan, Amirsaman Memaripour, Anirudh Badam, and David Wentzlaff. 2017. HNVM: Hybrid NVM Enabled Datacenter Design and Optimization. In Microsoft Research TR (2017).Google Scholar
Index Terms
- Hyperloop: group-based NIC-offloading to accelerate replicated transactions in multi-tenant storage systems
Recommendations
Pre-Copy and post-copy VM live migration for memory intensive applications
Euro-Par'12: Proceedings of the 18th international conference on Parallel processing workshopsVirtualization technology provides a means for server consolidation, reducing the number of physical servers required for running a given workload. Virtual Machine (VM) live migration facilitates the transfer of a running (VM) between physical hosts ...
Mojim: A Reliable and Highly-Available Non-Volatile Memory System
ASPLOS'15Next-generation non-volatile memories (NVMs) promise DRAM-like performance, persistence, and high density. They can attach directly to processors to form non-volatile main memory (NVMM) and offer the opportunity to build very low-latency storage ...
Mojim: A Reliable and Highly-Available Non-Volatile Memory System
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsNext-generation non-volatile memories (NVMs) promise DRAM-like performance, persistence, and high density. They can attach directly to processors to form non-volatile main memory (NVMM) and offer the opportunity to build very low-latency storage ...
Comments