skip to main content
research-article

Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication

Authors Info & Claims
Published:18 September 2017Publication History
Skip Abstract Section

Abstract

In-memory key/value store (KV-store) is a key building block for many systems like databases and large websites. Two key requirements for such systems are efficiency and availability, which demand a KV-store to continuously handle millions of requests per second. A common approach to availability is using replication, such as primary-backup (PBR), which, however, requires M+1 times memory to tolerate M failures. This renders scarce memory unable to handle useful user jobs.

This article makes the first case of building highly available in-memory KV-store by integrating erasure coding to achieve memory efficiency, while not notably degrading performance. A main challenge is that an in-memory KV-store has much scattered metadata. A single KV put may cause excessive coding operations and parity updates due to excessive small updates to metadata. Our approach, namely Cocytus, addresses this challenge by using a hybrid scheme that leverages PBR for small-sized and scattered data (e.g., metadata and key), while only applying erasure coding to relatively large data (e.g., value). To mitigate well-known issues like lengthy recovery of erasure coding, Cocytus uses an online recovery scheme by leveraging the replicated metadata information to continuously serve KV requests. To further demonstrate the usefulness of Cocytus, we have built a transaction layer by using Cocytus as a fast and reliable storage layer to store database records and transaction logs. We have integrated the design of Cocytus to Memcached and extend it to support in-memory transactions. Evaluation using YCSB with different KV configurations shows that Cocytus incurs low overhead for latency and throughput, can tolerate node failures with fast online recovery, while saving 33% to 46% memory compared to PBR when tolerating two failures. A further evaluation using the SmallBank OLTP benchmark shows that in-memory transactions can run atop Cocytus with high throughput, low latency, and low abort rate and recover fast from consecutive failures.

References

  1. Mohammad Alomari, Michael Cahill, Alan Fekete, and Uwe Rohm. 2008. The cost of serializability on platforms that use snapshot isolation. In Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE’08). IEEE, 576--585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’12). ACM, 53--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters, and Peng Li. 2011. Paxos replicated state machines as the basis of a high-performance data store. In Proceedings of the Conference on Network Systems Design and Implementation (NSDI’11).Google ScholarGoogle Scholar
  4. Thomas C. Bressoud and Fred B. Schneider. 1996. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst. (TOCS) 14, 1 (1996), 80--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Yingyi Bu, Vinayak Borkar, Guoqing Xu, and Michael J. Carey. 2013. A bloat-aware design for big data applications. In Proceedings of the ACM SIGPLAN International Symposium on Memory Management. ACM, 119--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg. 1993. The primary-backup approach. Distrib. Syst. 2 (1993), 199--216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Michael J. Cahill, Uwe Röhm, and Alan D. Fekete. 2009. Serializable isolation for snapshot databases. ACM Trans. Database Syst. (TODS) 34, 4 (2009), 20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. (TOCS) 3, 1 (1985), 63--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Shimin Chen, Anastasia Ailamaki, Manos Athanassoulis, Phillip B. Gibbons, Ryan Johnson, Ippokratis Pandis, and Radu Stoica. 2011. TPC-E vs. TPC-C: Characterizing the new TPC-E benchmark via an I/O comparison study. ACM SIGMOD Rec. 39, 3 (2011), 5--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, and Haibo Chen. 2016. Fast and general distributed transactions using rdma and htm. In Proceedings of the 11th European Conference on Computer Systems. ACM, 26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche. 2009. Upright cluster services. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM, 277--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. 2011. NV-Heaps: Making persistent objects fast and safe with next-generation, non-volatile memories. In Proceedings of the ACM Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 105--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s globally distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 251--264.Google ScholarGoogle Scholar
  15. James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild et al. 2013. Spanner: Googles globally distributed database. ACM Trans. Comput. Syst. (TOCS) 31, 3 (2013), 8.Google ScholarGoogle ScholarCross RefCross Ref
  16. Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013. Hekaton: SQL server’s memory-optimized OLTP engine. In Proceedings of the 2013 International Conference on Management of Data. ACM, 1243--1254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. 2013. Oltp-bench: An extensible testbed for benchmarking relational databases. Proc. VLDB Endow. 7, 4 (2013), 277--288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 54--70. DOI:http://dx.doi.org/10.1145/2815400.2815425 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Leo Egghe. 2005. Zipfian and lotkaian continuous concentration theory. J. Amer. Soc. Info. Sci. Technol. 56, 9 (2005), 935--945. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Bin Fan, David G. Andersen, and Michael Kaminsky. 2013. MemC3: Compact and concurrent memcache with dumber caching and smarter hashing. In Proceedings of the Conference on Network Systems Design and Implementation (NSDI’13), Vol. 13. 385--398.Google ScholarGoogle Scholar
  21. Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, Ingo Müller, Hannes Rauhe, and Jonathan Dees. 2012. The SAP HANA database--An architecture overview. IEEE Data Eng. Bull. 35, 1 (2012), 28--33.Google ScholarGoogle Scholar
  22. Brad Fitzpatrick. 2004. Distributed caching with memcached. Linux J. 2004, 124 (2004), 5.Google ScholarGoogle Scholar
  23. Aakash Goel, Bhuwan Chopra, Ciprian Gerea, Dhrúv Mátáni, Josh Metzler, Fahim Ul Haq, and Janet Wiener. 2014. Fast database restarts at facebook. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 541--549.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, Sergey Yekhanin, and others. 2012. Erasure coding in windows azure storage. In Proceedings of the USENIX Annual Technical Conference. 15--26.Google ScholarGoogle Scholar
  25. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using rdma efficiently for key-value services. In Proceedings of the 2014 ACM Conference of the Special Interest Group on Data Communications (SIGCOMM’14). ACM, 295--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. KLab Inc. 2011. Homepage. Retrieved from http://repcached.lab.klab.org.Google ScholarGoogle Scholar
  27. Tirthankar Lahiri, Marie-Anne Neimat, and Steve Folkman. 2013. Oracle timesten: An in-memory database for enterprise applications. IEEE Data Eng. Bull. 36, 2 (2013), 6--13.Google ScholarGoogle Scholar
  28. Chunbo Lai, Song Jiang, Liqiong Yang, Shiding Lin, Guangyu Sun, Zhenyu Hou, Can Cui, and Jason Cong. 2015. Atlas: Baidu’s key-value storage system for cloud data. In Proceedings of the 2015 31st Symposium on Mass Storage Systems and Technologies (MSST’15). IEEE, 1--14.Google ScholarGoogle ScholarCross RefCross Ref
  29. Leslie Lamport. 2001. Paxos made simple. ACM Sigact News 32, 4 (2001), 18--25.Google ScholarGoogle Scholar
  30. Xiaozhou Li, David G. Andersen, Michael Kaminsky, and Michael J. Freedman. 2014. Algorithmic improvements for fast concurrent cuckoo hashing. In Proceedings of the 9th European Conference on Computer Systems. ACM, 27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ran Liu, Heng Zhang, and Haibo Chen. 2014. Scalable read-mostly synchronization using passive reader-writer locks. In Proceedings of the 2014 USENIX Annual Technical Conference, USENIX ATC, Vol. 14. 219--230.Google ScholarGoogle Scholar
  32. Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in SSD-conscious storage. In Proceedings of the Conference on File and Storage Technologies (FAST’16). 133--148.Google ScholarGoogle Scholar
  33. Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proceedings of the USENIX Annual Technical Conference. 103--114.Google ScholarGoogle Scholar
  34. Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang et al. 2014. F4: Facebooks warm BLOB storage system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, 383--398.Google ScholarGoogle Scholar
  35. Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab et al. 2013. Scaling memcache at facebook. In Proceedings of the Conference on Network Systems Design and Implementation (NSDI’13). 385--398.Google ScholarGoogle Scholar
  36. Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. 2011. Fast crash recovery in RAMCloud. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. ACM, 29--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. S. Plank, E. L. Miller, and W. B. Houston. 2013. GF-Complete: A Comprehensive Open Source Library for Galois Field Arithmetic. Technical Report UT-CS-13-703. University of Tennessee.Google ScholarGoogle Scholar
  38. J. S. Plank, S. Simmerman, and C. D. Schuman. 2008. Jerasure: A Library in C/C++ Facilitating Erasure Coding for Storage Applications—Version 1.2. Technical Report CS-08-627. University of Tennessee.Google ScholarGoogle Scholar
  39. K. V. Rashmi, Preetum Nakkiran, Jingyan Wang, Nihar B. Shah, and Kannan Ramchandran. 2015. Having your cake and eating it too: Jointly optimal erasure codes for I/O, storage, and network-bandwidth. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, 81--94.Google ScholarGoogle Scholar
  40. K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2013. A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. Proceedings of the Conference on USENIX HotStorage (2013).Google ScholarGoogle Scholar
  41. K. V. Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, and Kannan Ramchandran. 2016. EC-Cache: Load-balanced, low-latency cluster caching with online erasure coding. In Proceedings of the Conference on Operating Systems Design and Implementation (OSDI’16).Google ScholarGoogle Scholar
  42. Irving S. Reed and Gustave Solomon. 1960. Polynomial codes over certain finite fields. J. Soc. Industr. Appl. Math. 8, 2 (1960), 300--304. Google ScholarGoogle ScholarCross RefCross Ref
  43. Luigi Rizzo. 1997. Effective erasure codes for reliable computer communication protocols. ACM SIGCOMM Comput. Commun. Rev. 27, 2 (1997), 24--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. 2013. Xoring elephants: Novel erasure codes for big data. In Proceedings of the Very Large Data Base Endowment. VLDB Endowment, 325--336.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Nihar B. Shah, K. V. Rashmi, P. Vijay Kumar, and Kannan Ramchandran. 2012. Distributed storage codes with repair-by-transfer and nonachievability of interior points on the storage-bandwidth tradeoff. IEEE Trans. Info. Theory 58, 3 (2012), 1837--1852. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Mark Silberstein, Lakshmi Ganesh, Yang Wang, Lorenzo Alvisi, and Mike Dahlin. 2014. Lazy means smart: Reducing repair bandwidth costs in erasure-coded distributed storage. In Proceedings of International Conference on Systems and Storage. ACM, 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. SNIA. 2015. NVDIMM Special Interest Group. Retrieved from http://www.snia.org/forums/sssi/NVDIMM (2015).Google ScholarGoogle Scholar
  48. Patrick Stuedi, Animesh Trivedi, and Bernard Metzler. 2012. Wimpy nodes with 10GbE: Leveraging one-sided operations in Soft-RDMA to boost memcached. In Proceedings of the USENIX Annual Technical Conference. 347--353.Google ScholarGoogle Scholar
  49. Viking Technology. 2014. ArxCis-NV (TM): Non-Volatile DIMM. Retrieved from http://www.vikingtechnology.com/arxcis-nv.Google ScholarGoogle Scholar
  50. Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: Fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12). ACM, 1--12. DOI:http://dx.doi.org/10.1145/2213836.2213838 Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Twitter Inc. 2012. Twemcache is the Twitter Memcached. Retrieved from https://github.com/twitter/twemcache (2012).Google ScholarGoogle Scholar
  52. Robbert van Renesse and Fred B. Schneider. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the Conference on Operating Systems Design and Implementation (OSDI’04), Vol. 4. 91--104.Google ScholarGoogle Scholar
  53. Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, Roy H. Campbell, and others. 2011. Consistent and durable data structures for non-volatile byte-addressable memory. In Proceedings of the Conference on File and Storage Technologies (FAST’11). 61--75.Google ScholarGoogle Scholar
  54. Peng Wang, Kaiyuan Zhang, Rong Chen, Haibo Chen, and Haibing Guan. 2014b. Replication-based fault-tolerance for large-scale graph processing. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). IEEE, 562--573. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Yang Wang, Lorenzo Alvisi, and Mike Dahlin. 2012. Gnothi: Separating data and metadata for efficient and available storage replication. In Proceedings of the USENIX Annual Technical Conference. 413--424.Google ScholarGoogle Scholar
  56. Zhaoguo Wang, Hao Qian, Jinyang Li, and Haibo Chen. 2014a. Using restricted transactional memory to build a scalable in-memory database. In Proceedings of the 9th European Conference on Computer Systems. ACM, 26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles. ACM, 87--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Brent Welch, Marc Unangst, Zainul Abbasi, Garth A Gibson, Brian Mueller, Jason Small, Jim Zelenka, and Bin Zhou. 2008. Scalable performance of the panasas parallel file system. In Proceedings of the Conference on File and Storage Technologies (FAST’08), Vol. 8. 1--17.Google ScholarGoogle Scholar
  59. Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong, and Bingsheng He. 2015. NV-Tree: Reducing consistency cost for NVM-based single level systems. In Proceedings of the 13th USENIX Conference on File and Storage Technologies. USENIX Association, 167--181.Google ScholarGoogle Scholar
  60. Jian Yin, Jean-Philippe Martin, Arun Venkataramani, Lorenzo Alvisi, and Mike Dahlin. 2003. Separating agreement from execution for byzantine fault tolerant services. In Proceedings of the Symposium on Operating Systems Principles (SOSP’03). ACM, 253--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Jeremy Zawodny. 2009. Redis: Lightweight key/value store that goes the extra mile. Linux Mag. 79 (2009).Google ScholarGoogle Scholar
  62. Heng Zhang, Mingkai Dong, and Haibo Chen. 2016. Efficient and available in-memory KV-store with hybrid erasure coding and replication. In Proceedings of the USENIX Conference on File and Storage Technologies. 167--180.Google ScholarGoogle Scholar
  63. Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. 2015. Mojim: A reliable and highly-available non-volatile memory system. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 3--18. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Storage
        ACM Transactions on Storage  Volume 13, Issue 3
        Special Issue on FAST 2017 and Regular Papers
        August 2017
        265 pages
        ISSN:1553-3077
        EISSN:1553-3093
        DOI:10.1145/3141876
        • Editor:
        • Sam H. Noh
        Issue’s Table of Contents

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 September 2017
        • Revised: 1 March 2017
        • Accepted: 1 March 2017
        • Received: 1 September 2016
        Published in tos Volume 13, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader