Abstract
In-memory key/value store (KV-store) is a key building block for many systems like databases and large websites. Two key requirements for such systems are efficiency and availability, which demand a KV-store to continuously handle millions of requests per second. A common approach to availability is using replication, such as primary-backup (PBR), which, however, requires M+1 times memory to tolerate M failures. This renders scarce memory unable to handle useful user jobs.
This article makes the first case of building highly available in-memory KV-store by integrating erasure coding to achieve memory efficiency, while not notably degrading performance. A main challenge is that an in-memory KV-store has much scattered metadata. A single KV put may cause excessive coding operations and parity updates due to excessive small updates to metadata. Our approach, namely Cocytus, addresses this challenge by using a hybrid scheme that leverages PBR for small-sized and scattered data (e.g., metadata and key), while only applying erasure coding to relatively large data (e.g., value). To mitigate well-known issues like lengthy recovery of erasure coding, Cocytus uses an online recovery scheme by leveraging the replicated metadata information to continuously serve KV requests. To further demonstrate the usefulness of Cocytus, we have built a transaction layer by using Cocytus as a fast and reliable storage layer to store database records and transaction logs. We have integrated the design of Cocytus to Memcached and extend it to support in-memory transactions. Evaluation using YCSB with different KV configurations shows that Cocytus incurs low overhead for latency and throughput, can tolerate node failures with fast online recovery, while saving 33% to 46% memory compared to PBR when tolerating two failures. A further evaluation using the SmallBank OLTP benchmark shows that in-memory transactions can run atop Cocytus with high throughput, low latency, and low abort rate and recover fast from consecutive failures.
- Mohammad Alomari, Michael Cahill, Alan Fekete, and Uwe Rohm. 2008. The cost of serializability on platforms that use snapshot isolation. In Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE’08). IEEE, 576--585. Google ScholarDigital Library
- Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’12). ACM, 53--64. Google ScholarDigital Library
- William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters, and Peng Li. 2011. Paxos replicated state machines as the basis of a high-performance data store. In Proceedings of the Conference on Network Systems Design and Implementation (NSDI’11).Google Scholar
- Thomas C. Bressoud and Fred B. Schneider. 1996. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst. (TOCS) 14, 1 (1996), 80--107. Google ScholarDigital Library
- Yingyi Bu, Vinayak Borkar, Guoqing Xu, and Michael J. Carey. 2013. A bloat-aware design for big data applications. In Proceedings of the ACM SIGPLAN International Symposium on Memory Management. ACM, 119--130. Google ScholarDigital Library
- Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg. 1993. The primary-backup approach. Distrib. Syst. 2 (1993), 199--216.Google ScholarDigital Library
- Michael J. Cahill, Uwe Röhm, and Alan D. Fekete. 2009. Serializable isolation for snapshot databases. ACM Trans. Database Syst. (TODS) 34, 4 (2009), 20.Google ScholarDigital Library
- K. Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. (TOCS) 3, 1 (1985), 63--75. Google ScholarDigital Library
- Shimin Chen, Anastasia Ailamaki, Manos Athanassoulis, Phillip B. Gibbons, Ryan Johnson, Ippokratis Pandis, and Radu Stoica. 2011. TPC-E vs. TPC-C: Characterizing the new TPC-E benchmark via an I/O comparison study. ACM SIGMOD Rec. 39, 3 (2011), 5--10. Google ScholarDigital Library
- Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, and Haibo Chen. 2016. Fast and general distributed transactions using rdma and htm. In Proceedings of the 11th European Conference on Computer Systems. ACM, 26. Google ScholarDigital Library
- Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche. 2009. Upright cluster services. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM, 277--290. Google ScholarDigital Library
- Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. 2011. NV-Heaps: Making persistent objects fast and safe with next-generation, non-volatile memories. In Proceedings of the ACM Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 105--118. Google ScholarDigital Library
- Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, 143--154. Google ScholarDigital Library
- James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s globally distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 251--264.Google Scholar
- James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild et al. 2013. Spanner: Googles globally distributed database. ACM Trans. Comput. Syst. (TOCS) 31, 3 (2013), 8.Google ScholarCross Ref
- Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013. Hekaton: SQL server’s memory-optimized OLTP engine. In Proceedings of the 2013 International Conference on Management of Data. ACM, 1243--1254. Google ScholarDigital Library
- Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. 2013. Oltp-bench: An extensible testbed for benchmarking relational databases. Proc. VLDB Endow. 7, 4 (2013), 277--288.Google ScholarDigital Library
- Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 54--70. DOI:http://dx.doi.org/10.1145/2815400.2815425 Google ScholarDigital Library
- Leo Egghe. 2005. Zipfian and lotkaian continuous concentration theory. J. Amer. Soc. Info. Sci. Technol. 56, 9 (2005), 935--945. Google ScholarDigital Library
- Bin Fan, David G. Andersen, and Michael Kaminsky. 2013. MemC3: Compact and concurrent memcache with dumber caching and smarter hashing. In Proceedings of the Conference on Network Systems Design and Implementation (NSDI’13), Vol. 13. 385--398.Google Scholar
- Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, Ingo Müller, Hannes Rauhe, and Jonathan Dees. 2012. The SAP HANA database--An architecture overview. IEEE Data Eng. Bull. 35, 1 (2012), 28--33.Google Scholar
- Brad Fitzpatrick. 2004. Distributed caching with memcached. Linux J. 2004, 124 (2004), 5.Google Scholar
- Aakash Goel, Bhuwan Chopra, Ciprian Gerea, Dhrúv Mátáni, Josh Metzler, Fahim Ul Haq, and Janet Wiener. 2014. Fast database restarts at facebook. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 541--549.Google ScholarDigital Library
- Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, Sergey Yekhanin, and others. 2012. Erasure coding in windows azure storage. In Proceedings of the USENIX Annual Technical Conference. 15--26.Google Scholar
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using rdma efficiently for key-value services. In Proceedings of the 2014 ACM Conference of the Special Interest Group on Data Communications (SIGCOMM’14). ACM, 295--306. Google ScholarDigital Library
- KLab Inc. 2011. Homepage. Retrieved from http://repcached.lab.klab.org.Google Scholar
- Tirthankar Lahiri, Marie-Anne Neimat, and Steve Folkman. 2013. Oracle timesten: An in-memory database for enterprise applications. IEEE Data Eng. Bull. 36, 2 (2013), 6--13.Google Scholar
- Chunbo Lai, Song Jiang, Liqiong Yang, Shiding Lin, Guangyu Sun, Zhenyu Hou, Can Cui, and Jason Cong. 2015. Atlas: Baidu’s key-value storage system for cloud data. In Proceedings of the 2015 31st Symposium on Mass Storage Systems and Technologies (MSST’15). IEEE, 1--14.Google ScholarCross Ref
- Leslie Lamport. 2001. Paxos made simple. ACM Sigact News 32, 4 (2001), 18--25.Google Scholar
- Xiaozhou Li, David G. Andersen, Michael Kaminsky, and Michael J. Freedman. 2014. Algorithmic improvements for fast concurrent cuckoo hashing. In Proceedings of the 9th European Conference on Computer Systems. ACM, 27. Google ScholarDigital Library
- Ran Liu, Heng Zhang, and Haibo Chen. 2014. Scalable read-mostly synchronization using passive reader-writer locks. In Proceedings of the 2014 USENIX Annual Technical Conference, USENIX ATC, Vol. 14. 219--230.Google Scholar
- Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in SSD-conscious storage. In Proceedings of the Conference on File and Storage Technologies (FAST’16). 133--148.Google Scholar
- Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proceedings of the USENIX Annual Technical Conference. 103--114.Google Scholar
- Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang et al. 2014. F4: Facebooks warm BLOB storage system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, 383--398.Google Scholar
- Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab et al. 2013. Scaling memcache at facebook. In Proceedings of the Conference on Network Systems Design and Implementation (NSDI’13). 385--398.Google Scholar
- Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. 2011. Fast crash recovery in RAMCloud. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. ACM, 29--41. Google ScholarDigital Library
- J. S. Plank, E. L. Miller, and W. B. Houston. 2013. GF-Complete: A Comprehensive Open Source Library for Galois Field Arithmetic. Technical Report UT-CS-13-703. University of Tennessee.Google Scholar
- J. S. Plank, S. Simmerman, and C. D. Schuman. 2008. Jerasure: A Library in C/C++ Facilitating Erasure Coding for Storage Applications—Version 1.2. Technical Report CS-08-627. University of Tennessee.Google Scholar
- K. V. Rashmi, Preetum Nakkiran, Jingyan Wang, Nihar B. Shah, and Kannan Ramchandran. 2015. Having your cake and eating it too: Jointly optimal erasure codes for I/O, storage, and network-bandwidth. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, 81--94.Google Scholar
- K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2013. A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. Proceedings of the Conference on USENIX HotStorage (2013).Google Scholar
- K. V. Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, and Kannan Ramchandran. 2016. EC-Cache: Load-balanced, low-latency cluster caching with online erasure coding. In Proceedings of the Conference on Operating Systems Design and Implementation (OSDI’16).Google Scholar
- Irving S. Reed and Gustave Solomon. 1960. Polynomial codes over certain finite fields. J. Soc. Industr. Appl. Math. 8, 2 (1960), 300--304. Google ScholarCross Ref
- Luigi Rizzo. 1997. Effective erasure codes for reliable computer communication protocols. ACM SIGCOMM Comput. Commun. Rev. 27, 2 (1997), 24--36. Google ScholarDigital Library
- Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. 2013. Xoring elephants: Novel erasure codes for big data. In Proceedings of the Very Large Data Base Endowment. VLDB Endowment, 325--336.Google ScholarDigital Library
- Nihar B. Shah, K. V. Rashmi, P. Vijay Kumar, and Kannan Ramchandran. 2012. Distributed storage codes with repair-by-transfer and nonachievability of interior points on the storage-bandwidth tradeoff. IEEE Trans. Info. Theory 58, 3 (2012), 1837--1852. Google ScholarDigital Library
- Mark Silberstein, Lakshmi Ganesh, Yang Wang, Lorenzo Alvisi, and Mike Dahlin. 2014. Lazy means smart: Reducing repair bandwidth costs in erasure-coded distributed storage. In Proceedings of International Conference on Systems and Storage. ACM, 1--7. Google ScholarDigital Library
- SNIA. 2015. NVDIMM Special Interest Group. Retrieved from http://www.snia.org/forums/sssi/NVDIMM (2015).Google Scholar
- Patrick Stuedi, Animesh Trivedi, and Bernard Metzler. 2012. Wimpy nodes with 10GbE: Leveraging one-sided operations in Soft-RDMA to boost memcached. In Proceedings of the USENIX Annual Technical Conference. 347--353.Google Scholar
- Viking Technology. 2014. ArxCis-NV (TM): Non-Volatile DIMM. Retrieved from http://www.vikingtechnology.com/arxcis-nv.Google Scholar
- Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: Fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12). ACM, 1--12. DOI:http://dx.doi.org/10.1145/2213836.2213838 Google ScholarDigital Library
- Twitter Inc. 2012. Twemcache is the Twitter Memcached. Retrieved from https://github.com/twitter/twemcache (2012).Google Scholar
- Robbert van Renesse and Fred B. Schneider. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the Conference on Operating Systems Design and Implementation (OSDI’04), Vol. 4. 91--104.Google Scholar
- Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, Roy H. Campbell, and others. 2011. Consistent and durable data structures for non-volatile byte-addressable memory. In Proceedings of the Conference on File and Storage Technologies (FAST’11). 61--75.Google Scholar
- Peng Wang, Kaiyuan Zhang, Rong Chen, Haibo Chen, and Haibing Guan. 2014b. Replication-based fault-tolerance for large-scale graph processing. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). IEEE, 562--573. Google ScholarDigital Library
- Yang Wang, Lorenzo Alvisi, and Mike Dahlin. 2012. Gnothi: Separating data and metadata for efficient and available storage replication. In Proceedings of the USENIX Annual Technical Conference. 413--424.Google Scholar
- Zhaoguo Wang, Hao Qian, Jinyang Li, and Haibo Chen. 2014a. Using restricted transactional memory to build a scalable in-memory database. In Proceedings of the 9th European Conference on Computer Systems. ACM, 26. Google ScholarDigital Library
- Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles. ACM, 87--104. Google ScholarDigital Library
- Brent Welch, Marc Unangst, Zainul Abbasi, Garth A Gibson, Brian Mueller, Jason Small, Jim Zelenka, and Bin Zhou. 2008. Scalable performance of the panasas parallel file system. In Proceedings of the Conference on File and Storage Technologies (FAST’08), Vol. 8. 1--17.Google Scholar
- Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong, and Bingsheng He. 2015. NV-Tree: Reducing consistency cost for NVM-based single level systems. In Proceedings of the 13th USENIX Conference on File and Storage Technologies. USENIX Association, 167--181.Google Scholar
- Jian Yin, Jean-Philippe Martin, Arun Venkataramani, Lorenzo Alvisi, and Mike Dahlin. 2003. Separating agreement from execution for byzantine fault tolerant services. In Proceedings of the Symposium on Operating Systems Principles (SOSP’03). ACM, 253--267. Google ScholarDigital Library
- Jeremy Zawodny. 2009. Redis: Lightweight key/value store that goes the extra mile. Linux Mag. 79 (2009).Google Scholar
- Heng Zhang, Mingkai Dong, and Haibo Chen. 2016. Efficient and available in-memory KV-store with hybrid erasure coding and replication. In Proceedings of the USENIX Conference on File and Storage Technologies. 167--180.Google Scholar
- Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. 2015. Mojim: A reliable and highly-available non-volatile memory system. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 3--18. Google ScholarDigital Library
Index Terms
- Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication
Recommendations
Erasure coding for small objects in in-memory KV storage
SYSTOR '17: Proceedings of the 10th ACM International Systems and Storage ConferenceWe present MemEC, an erasure-coding-based in-memory key-value (KV) store that achieves high availability and fast recovery while keeping low data redundancy across storage servers. MemEC is specifically designed for workloads dominated by small objects. ...
Efficient and available in-memory KV-store with hybrid erasure coding and replication
FAST'16: Proceedings of the 14th Usenix Conference on File and Storage TechnologiesIn-memory key/value store (KV-store) is a key building block for many systems like databases and large websites. Two key requirements for such systems are efficiency and availability, which demand a KV-store to continuously handle millions of requests ...
Data Delta Based Hybrid Writes for Erasure-Coded Storage Systems
Network and Parallel ComputingAbstractErasure coding is widely used in storage systems since it can offer higher reliability at lower redundancy than data replication. However, erasure-coded storage systems have to perform a partial write to an entire erasure coding group for a small ...
Comments