ABSTRACT
PCIe-based Flash is commonly deployed to provide datacenter applications with high IO rates. However, its capacity and bandwidth are often underutilized as it is difficult to design servers with the right balance of CPU, memory and Flash resources over time and for multiple applications. This work examines Flash disaggregation as a way to deal with Flash overprovisioning. We tune remote access to Flash over commodity networks and analyze its impact on workloads sampled from real datacenter applications. We show that, while remote Flash access introduces a 20% throughput drop at the application level, disaggregation allows us to make up for these overheads through resource-efficient scale-out. Hence, we show that Flash disaggregation allows scaling CPU and Flash resources independently in a cost effective manner. We use our analysis to draw conclusions about data and control plane issues in remote storage.
- Amazon. Amazon Elastic Block Store. https://aws.amazon.com/ebs/, 2016.Google Scholar
- G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Disk-locality in datacenter computing considered irrelevant. In Proc. of USENIX Hot Topics in Operating Systems, HotOS' 13, pages 12--12, 2011. Google ScholarDigital Library
- D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: a fast array of wimpy nodes. In Proc. of ACM SIGOPS Symposium on Operating Systems Principles, SOSP '09, pages 1--14. ACM, 2009. Google ScholarDigital Library
- S. Angel, H. Ballani, T. Karagiannis, G. O'Shea, and E. Thereska. End-to-end performance isolation through virtual datacenters. In Proc. of USENIX Operating Systems Design and Implementation, OSDI'14, pages 233--248, Oct. 2014. Google ScholarDigital Library
- Apache Software Foundation. Apache Thrift. https://thrift.apache.org, 2014.Google Scholar
- Avago Technologies. Storage and PCI Express -- A Natural Combination. http://www.avagotech.com/applications/datacenters/enterprise-storage, 2015.Google Scholar
- M. Balakrishnan, D. Malkhi, V. Prabhakaran, T. Wobber, M. Wei, and J. D. Davis. Corfu: A shared log design for flash clusters. In Proc. of USENIX Networked Systems Design and Implementation, NSDI'12, pages 1--1, 2012. Google ScholarDigital Library
- S. Balakrishnan, R. Black, A. Donnelly, P. England, A. Glass, D. Harper, S. Legtchenko, A. Ogus, E. Peterson, and A. Rowstron. Pelican: A building block for exascale cold data storage. In Proc. of USENIX Operating Systems Design and Implementation, OSDI'14, pages 351--365, Oct. 2014. Google ScholarDigital Library
- L. A. Barroso and U. Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. 2009. Google ScholarDigital Library
- A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, and E. Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In Proc. of USENIX Operating Systems Design and Implementation, OSDI'14, pages 49--65, Oct. 2014. Google ScholarDigital Library
- M. Chadalapaka, H. Shah, U. Elzur, P. Thaler, and M. Ko. A study of iSCSI extensions for RDMA (iSER). In Proc. of ACM SIGCOMM Workshop on Network-I/O Convergence: Experience, Lessons, Implications, NICELI '03, pages 209--219. ACM, 2003. Google ScholarDigital Library
- Chelsio Communications. NVM Express over Fabrics. http://www.chelsio.com/wp-content/uploads/resources/NVM_Express_Over_Fabrics.pdf, 2014.Google Scholar
- F. Chen, D. A. Koufaty, and X. Zhang. Understanding intrinsic characteristics and system implications of flash memory based solid state drives. In Proc. of Measurement and Modeling of Computer Systems, SIGMETRICS '09, pages 181--192. ACM, 2009. Google ScholarDigital Library
- P. Costa, H. Ballani, K. Razavi, and I. Kash. R2C2: a network stack for rack-scale computers. In Proc. of ACM Conference on Special Interest Group on Data Communication, SIGCOMM '15, pages 551--564. ACM, 2015. Google ScholarDigital Library
- B. Cully, J. Wires, D. Meyer, K. Jamieson, K. Fraser, T. Deegan, D. Stodden, G. Lefebvre, D. Ferstay, and A. Warfield. Strata: High-performance scalable storage on virtualized nonvolatile memory. In Proc. of USENIX File and Storage Technologies (FAST 14), pages 17--31. USENIX, 2014. Google ScholarDigital Library
- C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. In Proc. of International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIX, pages 127--144. ACM, 2014. Google ScholarDigital Library
- Dell Inc. PowerEdge PCIe Express Flash SSD. http://www.dell.com/learn/us/en/04/campaigns/poweredge-express-flash, 2015.Google Scholar
- Facebook Inc. Open Compute Project. http://www.opencompute.org/projects, 2015.Google Scholar
- Facebook Inc. RocksDB: A persistent key-value store for fast storage environments. http://rocksdb.org, 2015.Google Scholar
- Fusion IO. Atomic Series Server Flash. http://www.fusionio.com/products/atomic-series, 2015.Google Scholar
- S. Ghemawat and J. Dean. LevelDB. https://github.com/google/leveldb, 2014.Google Scholar
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proc. of ACM Symposium on Operating Systems Principles, SOSP '03, pages 29--43. ACM, 2003. Google ScholarDigital Library
- Google. Protocol Buffers. https://developers.google.com/protocol-buffers, 2015.Google Scholar
- A. Gulati, I. Ahmad, and C. A. Waldspurger. Parda: Proportional allocation of resources for distributed storage access. In Proc. of USENIX File and Storage Technologies, FAST '09, pages 85--98, 2009. Google ScholarDigital Library
- J. Hamilton. Keynote: Internet-scale service infrastructure efficiency. In Proc. of International Symposium on Computer Architecture, ISCA '09, June 2009. Google ScholarDigital Library
- S. Han, N. Egi, A. Panda, S. Ratnasamy, G. Shi, and S. Shenker. Network support for resource disaggregation in next-generation datacenters. In Proc. of ACM Workshop on Hot Topics in Networks, HotNets-XII, pages 10:1--10:7. ACM, 2013. Google ScholarDigital Library
- S. Han, S. Marshall, B.-G. Chun, and S. Ratnasamy. Megapipe: A new programming interface for scalable network i/o. In Proc. of USENIX Operating Systems Design and Implementation, OSDI'12, pages 135--148, 2012. Google ScholarDigital Library
- HGST. LinkedIn scales to 200 million users with PCIe Flash storage from HGST. https://www.hgst.com/sites/default/files/resources/LinkedIn-Scales-to-200M-Users-CS.pdf, 2014.Google Scholar
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proc. of USENIX Networked Systems Design and Implementation, NSDI'11, pages 295--308, 2011. Google ScholarDigital Library
- HP. Moonshot system. http://www8.hp.com/us/en/products/servers/moonshot/, 2015.Google Scholar
- ideawu. SSDB with RocksDB. https://github.com/ideawu/ssdb-rocks, 2014.Google Scholar
- Intel. Intel Ethernet Flow Director. http://www.intel.com/content/www/us/en/ethernet-products/ethernet-flow-director-video.html, 2016.Google Scholar
- Intel Corp. Intel Rack Scale Architecture Platform. http://www.intel.com/content/dam/www/public/us/en/documents/guides/rack-scale-hardware-guide.pdf, 2015.Google Scholar
- Intel Corp. Intel Solid-State Drive DC P3600 Series. http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-p3600-spec.pdf, 2015.Google Scholar
- Jens Axboe. Flexible IO tester (FIO). http://git.kernel.dk/?p=fio.git;a=summary, 2015.Google Scholar
- E. Y. Jeong, S. Woo, M. Jamshed, H. Jeong, S. Ihm, D. Han, and K. Park. mTCP: A highly scalable user-level tcp stack for multicore systems. In Proc. of USENIX Networked Systems Design and Implementation, NSDI'14, pages 489--502, 2014. Google ScholarDigital Library
- A. Joglekar, M. E. Kounavis, and F. L. Berry. A scalable and high performance software iSCSI implementation. In In Proc. of USENIX File and Storage Technologies., pages 267--280, 2005. Google ScholarDigital Library
- A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. SIGCOMM Comput. Commun. Rev., 44(4):295--306, Aug. 2014. Google ScholarDigital Library
- S. Kanev, J. P. Darago, K. M. Hazelwood, P. Ranganathan, T. Moseley, G. Wei, and D. M. Brooks. Profiling a warehouse-scale computer. In Proc. of Annual International Symposium on Computer Architecture, ISCA '15, pages 158--169, 2015. Google ScholarDigital Library
- E. K. Lee and C. A. Thekkath. Petal: Distributed virtual disks. In Proc. of Architectural Support for Programming Languages and Operating Systems, ASPLOS VII, pages 84--92. ACM, 1996. Google ScholarDigital Library
- J. Leverich. Mutilate: High-Performance Memcached Load Generator. https://github.com/leverich/mutilate, 2014.Google Scholar
- J. Leverich and C. Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. In Proc. of European Conference on Computer Systems, EuroSys '14, pages 4:1--4:14. ACM, 2014. Google ScholarDigital Library
- K. T. Lim, J. Chang, T. N. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F. Wenisch. Disaggregated memory for expansion and sharing in blade servers. In 36th International Symposium on Computer Architecture (ISCA 2009), pages 267--278, 2009. Google ScholarDigital Library
- K. T. Lim, Y. Turner, J. R. Santos, A. AuYoung, J. Chang, P. Ranganathan, and T. F. Wenisch. System-level implications of disaggregated memory. In 18th IEEE International Symposium on High Performance Computer Architecture, HPCA 2012, pages 189--200, 2012. Google ScholarDigital Library
- LinkedIn Inc. Project Voldemort: A distributed key-value storage system. http://www.project-voldemort.com/voldemort, 2015.Google Scholar
- C. Loboz. Cloud resource usage-heavy tailed distributions invalidating traditional capacity planning models. Journal of Grid Computing, 10(1):85--108, 2012. Google ScholarDigital Library
- Y. Lu and D. Du. Performance study of iSCSI-based storage subsystems. Communications Magazine, IEEE, 41(8):76--82, Aug 2003. Google ScholarDigital Library
- I. Marinos, R. N. Watson, and M. Handley. Network stack specialization for performance. In Proc. of ACM SIGCOMM, SIGCOMM'14, pages 175--186, 2014. Google ScholarDigital Library
- Mellanox Technologies. RoCE in the Data Center. http://www.mellanox.com/related-docs/whitepapers/roce_in_the_data_center.pdf, 2014.Google Scholar
- R. Micheloni, A. Marelli, and K. Eshghi. Inside Solid State Drives (SSDs). Springer Publishing Company, Incorporated, 2012. Google ScholarDigital Library
- J. Mickens, E. B. Nightingale, J. Elson, D. Gehring, B. Fan, A. Kadav, V. Chidambaram, O. Khan, and K. Nareddy. Blizzard: Fast, cloud-scale block storage for cloud-oblivious applications. In Proc. of USENIX Networked Systems Design and Implementation, NSDI'14, pages 257--273, Apr. 2014. Google ScholarDigital Library
- Microsoft. Introduction to Receive Side Scaling. https://msdn.microsoft.com/library/windows/hardware/ff556942.aspx, 2016.Google Scholar
- D. Narayanan, E. Thereska, A. Donnelly, S. Elnikety, and A. Rowstron. Migrating server storage to SSDs: Analysis of tradeoffs. In Proc. of European Conference on Computer Systems, EuroSys '09, pages 145--158. ACM, 2009. Google ScholarDigital Library
- R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: Managing performance interference effects for QoS-aware clouds. In Proc. of European Conference on Computer Systems, EuroSys '10, pages 237--250. ACM, 2010. Google ScholarDigital Library
- NVM Express Inc. NVM Express: the optimized PCI Express SSD interface. http://www.nvmexpress.org, 2015.Google Scholar
- J. Ouyang, S. Lin, J. Song, Z. Hou, Y. Wang, and Y. Wang. SDF: software-defined flash for web-scale internet storage systems. In Architectural Support for Programming Languages and Operating Systems, ASPLOS XIX, pages 471--484, 2014. Google ScholarDigital Library
- S. Park and K. Shen. FIOS: a fair, efficient flash I/O scheduler. In Proc. of USENIX File and Storage Technologies, FAST'12, page 13, 2012. Google ScholarDigital Library
- A. Pesterev, J. Strauss, N. Zeldovich, and R. T. Morris. Improving network connection locality on multicore systems. In Proc. of ACM European Conference on Computer Systems, EuroSys'12, pages 337--350. ACM, 2012. Google ScholarDigital Library
- P. Radkov, L. Yin, P. Goyal, P. Sarkar, and P. Shenoy. A performance comparison of NFS and iSCSI for IP-networked storage. In In Proc. of USENIX File and Storage Technologies., pages 101--114, 2004. Google ScholarDigital Library
- R. Sandberg. Design and implementation of the Sun network filesystem. In In Proc. of USENIX Summer Conference., pages 119--130. 1985.Google Scholar
- Satran, et al. Internet Small Computer Systems Interface (iSCSI). https://www.ietf.org/rfc/rfc3720.txt, 2004. Google ScholarDigital Library
- M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In SIGOPS European Conference on Computer Systems, EuroSys'13, pages 351--364, 2013. Google ScholarDigital Library
- SeaMicro. SM15000 fabric compute systems. http://www.seamicro.com/sites/default/files/SM15000_Datasheet.pdf, 2015.Google Scholar
- D. Shue and M. J. Freedman. From application requests to virtual iops: provisioned key-value storage with libra. In Proc. of European Conference on Computer Systems, EuroSys'14, pages 17:1--17:14, 2014. Google ScholarDigital Library
- D. Shue, M. J. Freedman, and A. Shaikh. Performance isolation and fairness for multi-tenant cloud storage. In Proc. of USENIX Operating Systems Design and Implementation, OSDI'12, pages 349--362, 2012. Google ScholarDigital Library
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. In Proc. of IEEE Mass Storage Systems and Technologies, MSST '10, pages 1--10. IEEE Computer Society, 2010. Google ScholarDigital Library
- Solarflare Communications Inc. OpenOnload. http://www.openonload.org/, 2013.Google Scholar
- M. Stokely, A. Mehrabian, C. Albrecht, F. Labelle, and A. Merchant. Projecting disk usage based on historical trends in a cloud environment. In ScienceCloud Proc. of International Workshop on Scientific Cloud Computing, pages 63--70, 2012. Google ScholarDigital Library
- C.-C. Tu, C.-t. Lee, and T.-c. Chiueh. Secure I/O device sharing among virtual machines on multiple hosts. In Proc. of International Symposium on Computer Architecture, ISCA '13, pages 108--119. ACM, 2013. Google ScholarDigital Library
- M. Uysal, A. Merchant, and G. A. Alvarez. Using MEMS-based storage in disk arrays. In Proc. of USENIX File and Storage Technologies, FAST'03, pages 7--7, 2003. Google ScholarDigital Library
- A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In Proc. of European Conference on Computer Systems, EuroSys'15, 2015. Google ScholarDigital Library
- VMware. Virtual SAN. https://www.vmware.com/products/virtual-san, 2016.Google Scholar
- M. Wachs, M. Abd-El-Malek, E. Thereska, and G. R. Ganger. Argon: Performance insulation for shared storage servers. In Proc. of USENIX File and Storage Technologies, FAST '07, pages 5--5, 2007. Google ScholarDigital Library
- A. Wang, S. Venkataraman, S. Alspaugh, R. Katz, and I. Stoica. Cake: Enabling high-level SLOs on shared storage systems. In Proc. of ACM Symposium on Cloud Computing, SoCC '12, pages 14:1--14:14. ACM, 2012. Google ScholarDigital Library
- A. Warfield, R. Ross, K. Fraser, C. Limpach, and S. Hand. Parallax: Managing storage for a million machines. In Proc. of USENIX Hot Topics in Operating Systems - Volume 10, HOTOS'05, pages 4--4, 2005. Google ScholarDigital Library
- D. Xinidis, A. Bilas, and M. D. Flouris. Performance evaluation of commodity iSCSI-based storage systems. In Proc. of IEEE/NASA Goddard Mass Storage Systems and Technologies, MSST '05, pages 261--269. IEEE Computer Society, 2005. Google ScholarDigital Library
Index Terms
- Flash storage disaggregation
Recommendations
Performance Characterization of NVMe-over-Fabrics Storage Disaggregation
Special Section on Systor 2017 and Regular PapersStorage disaggregation separates compute and storage to different nodes to allow for independent resource scaling and, thus, better hardware resource utilization. While disaggregation of hard-drives storage is a common practice, NVMe-SSD (i.e., PCIe-...
ReFlex: Remote Flash ≈ Local Flash
Asplos'17Remote access to NVMe Flash enables flexible scaling and high utilization of Flash capacity and IOPS within a datacenter. However, existing systems for remote Flash access either introduce significant performance overheads or fail to isolate the ...
ReFlex: Remote Flash ≈ Local Flash
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsRemote access to NVMe Flash enables flexible scaling and high utilization of Flash capacity and IOPS within a datacenter. However, existing systems for remote Flash access either introduce significant performance overheads or fail to isolate the ...
Comments