skip to main content
10.1145/2928275.2928278acmconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
research-article
Public Access
Best Student Paper

SSD Failures in Datacenters: What? When? and Why?

Published:06 June 2016Publication History

ABSTRACT

Despite the growing popularity of Solid State Disks (SSDs) in the datacenter, little is known about their reliability characteristics in the field. The little knowledge is mainly vendor supplied, and such information cannot really help understand how SSD failures can manifest and impact the operation of production systems, in order to take appropriate remedial measures. Besides actual failure data and the symptoms exhibited by SSDs before failing, a detailed characterization effort requires wide set of data about factors influencing SSD failures, right from provisioning factors to the operational ones. This paper presents an extensive SSD failure characterization by analyzing a wide spectrum of data from over half a million SSDs that span multiple generations spread across several datacenters which host a wide spectrum of workloads over nearly 3 years. By studying the diverse set of design, provisioning and operational factors on failures, and their symptoms, our work provides the first comprehensive analysis of the what, when and why characteristics of SSD failures in production datacenters.

References

  1. Enhanced Content Distribution Network with Intel Solid-State Drives. http://www.intel.fr/content/dam/www/public/us/en/documents/case-studies/cloud-computing-ssd-beijing-fastwebcase-study.pdf.Google ScholarGoogle Scholar
  2. American National Standards Institute. AT attachment 8 - ATA/ATAPI command set (ATA8-ACS), 2008. URL http://www.t13.org/documents/uploadeddocuments/docs2008/d1699r6a-ata8-acs.pdf.Google ScholarGoogle Scholar
  3. D. G. Andersen and S. Swanson. Rethinking Flash in the Data Center. IEEE Micro, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Boboila and P. Desnoyers. Write Endurance in Flash Drives: Measurements and Analysis. In USENIX FAST, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Breiman. Random Forests. Machine learning, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. Cai, E. Haratsch, O. Mutlu, and K. Mai. Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis. In DATE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, A. Cristal, O. S. Unsal, and K. Mai. Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime. In ICCD, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Cai, O. Mutlu, E. F. Haratsch, and K. Mai. Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation. In ICCD, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  9. Y. Cai, Y. Luo, S. Ghose, E. F. Haratsch, K. Mai, and O. Mutlu. Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery. In IEEE/IFIP DSN, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Cai, Y. Luo, E. F. Haratsch, K. Mai, and O. Mutlu. Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery. In HPCA, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  11. P. Cappelletti, R. Bez, D. Cantarelli, and L. Fratin. Failure Mechanisms of Flash Cell in Program/Erase Cycling. In IEDM Tech. Dig., 1994.Google ScholarGoogle Scholar
  12. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Int. Res., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Debnath, S. Sengupta, and J. Li. FlashStore: High Throughput Persistent Key-value Store. Proc. VLDB Endow., 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Deng. Interpreting Tree Ensembles with inTrees. arXiv preprint arXiv:1408.5456, 2014.Google ScholarGoogle Scholar
  15. L. M. Grupp, A. M. Caulfield, J. Coburn, S. Swanson, E. Yaakobi, P. H. Siegel, and J. K. Wolf. Characterizing Flash Memory: Anomalies, Observations, and Applications. In MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. L. M. Grupp, J. D. Davis, and S. Swanson. The Bleak Future of NAND Flash Memory. In USENIX FAST, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. X.-Y. Hu, E. Eleftheriou, R. Haas, I. Iliadis, and R. Pletka. Write amplification analysis in flash-based solid state drives. In ACM SYSTOR, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Isard. Autopilot: Automatic Data Center Management. SIGOPS Oper. Syst. Rev., 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Jiang, C. Hu, Y. Zhou, and A. Kanevsky. Are Disks the Dominant Contributor for Storage Failures?: A Comprehensive Study of Storage Subsystem Failure Characteristics.Trans. Storage, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Jung and M. Kandemir. Revisiting Widely Held SSD Expectations and Rethinking System-level Implications. In ACM SIGMETRICS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Kalisch. Package pcalg. 2015.Google ScholarGoogle Scholar
  22. E. L. Kaplan and P. Meier. Nonparametric estimation from incomplete observations. Journal of the American statistical association, 1958.Google ScholarGoogle Scholar
  23. S. L. Lauritzen. Graphical models. 1996.Google ScholarGoogle Scholar
  24. J. Meza, Q. Wu, S. Kumar, and O. Mutlu. A Large-Scale Study of Flash Memory Failures in the Field. ACM SIGMETRICS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Microsoft Azure Premium Storage. Microsoft Azure Premium Storage, 2015. https://azure.microsoft.com/en-us/blog/azure-premium-storage-now-generally-available-2/.Google ScholarGoogle Scholar
  26. N. Mielke, T. Marquart, N. Wu, J. Kessenich, H. Belgal, E. Schares, F. Trivedi, E. Goodness, and L. Nevill. Bit error rate in nand flash memories. In IRPS 2008., 2008.Google ScholarGoogle ScholarCross RefCross Ref
  27. J. Pearl. Causality: Models, Reasoning, and Inference. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure Trends in a Large Disk Drive Population. In USENIX FAST, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Schroeder and G. A. Gibson. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In USENIX FAST, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM Errors in the Wild: A Large-scale Field Study. In ACM SIGMETRICS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B. Schroeder, R. Lagisetty, and A. Merchant. Flash reliability in production: The expected and the unexpected. In FAST, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. H.-W. Tseng, L. Grupp, and S. Swanson. Understanding the Impact of Power Loss on Flash Memory. In DAC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Zheng, J. Tucek, F. Qin, and M. Lillibridge. Understanding the Robustness of SSDs Under Power Fault. In USENIX FAST, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SSD Failures in Datacenters: What? When? and Why?

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SYSTOR '16: Proceedings of the 9th ACM International on Systems and Storage Conference
        June 2016
        191 pages
        ISBN:9781450343817
        DOI:10.1145/2928275

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 June 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        SYSTOR '16 Paper Acceptance Rate16of49submissions,33%Overall Acceptance Rate94of285submissions,33%

        Upcoming Conference

        SYSTOR '24
        The 17th ACM International Systems and Storage Conference
        September 23 - 25, 2024
        Tel-Aviv , Israel

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader