ABSTRACT
Despite the growing popularity of Solid State Disks (SSDs) in the datacenter, little is known about their reliability characteristics in the field. The little knowledge is mainly vendor supplied, and such information cannot really help understand how SSD failures can manifest and impact the operation of production systems, in order to take appropriate remedial measures. Besides actual failure data and the symptoms exhibited by SSDs before failing, a detailed characterization effort requires wide set of data about factors influencing SSD failures, right from provisioning factors to the operational ones. This paper presents an extensive SSD failure characterization by analyzing a wide spectrum of data from over half a million SSDs that span multiple generations spread across several datacenters which host a wide spectrum of workloads over nearly 3 years. By studying the diverse set of design, provisioning and operational factors on failures, and their symptoms, our work provides the first comprehensive analysis of the what, when and why characteristics of SSD failures in production datacenters.
- Enhanced Content Distribution Network with Intel Solid-State Drives. http://www.intel.fr/content/dam/www/public/us/en/documents/case-studies/cloud-computing-ssd-beijing-fastwebcase-study.pdf.Google Scholar
- American National Standards Institute. AT attachment 8 - ATA/ATAPI command set (ATA8-ACS), 2008. URL http://www.t13.org/documents/uploadeddocuments/docs2008/d1699r6a-ata8-acs.pdf.Google Scholar
- D. G. Andersen and S. Swanson. Rethinking Flash in the Data Center. IEEE Micro, 2010. Google ScholarDigital Library
- S. Boboila and P. Desnoyers. Write Endurance in Flash Drives: Measurements and Analysis. In USENIX FAST, 2010. Google ScholarDigital Library
- L. Breiman. Random Forests. Machine learning, 2001. Google ScholarDigital Library
- Y. Cai, E. Haratsch, O. Mutlu, and K. Mai. Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis. In DATE, 2012. Google ScholarDigital Library
- Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, A. Cristal, O. S. Unsal, and K. Mai. Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime. In ICCD, 2012.Google ScholarDigital Library
- Y. Cai, O. Mutlu, E. F. Haratsch, and K. Mai. Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation. In ICCD, 2013.Google ScholarCross Ref
- Y. Cai, Y. Luo, S. Ghose, E. F. Haratsch, K. Mai, and O. Mutlu. Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery. In IEEE/IFIP DSN, 2015. Google ScholarDigital Library
- Y. Cai, Y. Luo, E. F. Haratsch, K. Mai, and O. Mutlu. Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery. In HPCA, 2015.Google ScholarCross Ref
- P. Cappelletti, R. Bez, D. Cantarelli, and L. Fratin. Failure Mechanisms of Flash Cell in Program/Erase Cycling. In IEDM Tech. Dig., 1994.Google Scholar
- N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Int. Res., 2002. Google ScholarDigital Library
- B. Debnath, S. Sengupta, and J. Li. FlashStore: High Throughput Persistent Key-value Store. Proc. VLDB Endow., 2010. Google ScholarDigital Library
- H. Deng. Interpreting Tree Ensembles with inTrees. arXiv preprint arXiv:1408.5456, 2014.Google Scholar
- L. M. Grupp, A. M. Caulfield, J. Coburn, S. Swanson, E. Yaakobi, P. H. Siegel, and J. K. Wolf. Characterizing Flash Memory: Anomalies, Observations, and Applications. In MICRO, 2009. Google ScholarDigital Library
- L. M. Grupp, J. D. Davis, and S. Swanson. The Bleak Future of NAND Flash Memory. In USENIX FAST, 2012. Google ScholarDigital Library
- X.-Y. Hu, E. Eleftheriou, R. Haas, I. Iliadis, and R. Pletka. Write amplification analysis in flash-based solid state drives. In ACM SYSTOR, 2009. Google ScholarDigital Library
- M. Isard. Autopilot: Automatic Data Center Management. SIGOPS Oper. Syst. Rev., 2007. Google ScholarDigital Library
- W. Jiang, C. Hu, Y. Zhou, and A. Kanevsky. Are Disks the Dominant Contributor for Storage Failures?: A Comprehensive Study of Storage Subsystem Failure Characteristics.Trans. Storage, 2008. Google ScholarDigital Library
- M. Jung and M. Kandemir. Revisiting Widely Held SSD Expectations and Rethinking System-level Implications. In ACM SIGMETRICS, 2013. Google ScholarDigital Library
- M. Kalisch. Package pcalg. 2015.Google Scholar
- E. L. Kaplan and P. Meier. Nonparametric estimation from incomplete observations. Journal of the American statistical association, 1958.Google Scholar
- S. L. Lauritzen. Graphical models. 1996.Google Scholar
- J. Meza, Q. Wu, S. Kumar, and O. Mutlu. A Large-Scale Study of Flash Memory Failures in the Field. ACM SIGMETRICS, 2015. Google ScholarDigital Library
- Microsoft Azure Premium Storage. Microsoft Azure Premium Storage, 2015. https://azure.microsoft.com/en-us/blog/azure-premium-storage-now-generally-available-2/.Google Scholar
- N. Mielke, T. Marquart, N. Wu, J. Kessenich, H. Belgal, E. Schares, F. Trivedi, E. Goodness, and L. Nevill. Bit error rate in nand flash memories. In IRPS 2008., 2008.Google ScholarCross Ref
- J. Pearl. Causality: Models, Reasoning, and Inference. 2000. Google ScholarDigital Library
- E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure Trends in a Large Disk Drive Population. In USENIX FAST, 2007. Google ScholarDigital Library
- B. Schroeder and G. A. Gibson. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In USENIX FAST, 2007. Google ScholarDigital Library
- B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM Errors in the Wild: A Large-scale Field Study. In ACM SIGMETRICS, 2009. Google ScholarDigital Library
- B. Schroeder, R. Lagisetty, and A. Merchant. Flash reliability in production: The expected and the unexpected. In FAST, 2016. Google ScholarDigital Library
- H.-W. Tseng, L. Grupp, and S. Swanson. Understanding the Impact of Power Loss on Flash Memory. In DAC, 2011. Google ScholarDigital Library
- M. Zheng, J. Tucek, F. Qin, and M. Lillibridge. Understanding the Robustness of SSDs Under Power Fault. In USENIX FAST, 2013. Google ScholarDigital Library
Index Terms
- SSD Failures in Datacenters: What? When? and Why?
Recommendations
Analytic modeling of SSD write performance
SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage ConferenceSolid state drives (SSDs) update data by writing a new copy, rather than overwriting old data, causing prior copies of the same data to be invalidated. These writes are performed in units of pages, while space is reclaimed in units of multi-page erase ...
SSD Failures in Datacenters: What, When and Why?
SIGMETRICS '16: Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer ScienceDespite the growing popularity of Solid State Disks (SSDs) in the datacenter, little is known about their reliability characteristics in the field. The little knowledge is mainly vendor supplied, which cannot really help understand how SSD failures can ...
SSD Failures in Datacenters: What, When and Why?
Performance evaluation reviewDespite the growing popularity of Solid State Disks (SSDs) in the datacenter, little is known about their reliability characteristics in the field. The little knowledge is mainly vendor supplied, which cannot really help understand how SSD failures can ...
Comments