skip to main content
research-article

Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

Published:01 July 2013Publication History
Skip Abstract Section

Abstract

With the advent of cloud computing and online services, large enterprises rely heavily on their datacenters to serve end users. A large datacenter facility incurs increased maintenance costs in addition to service unavailability when there are increased failures. Among different server components, hard disk drives are known to contribute significantly to server failures; however, there is very little understanding of the major determinants of disk failures in datacenters. In this work, we focus on the interrelationship between temperature, workload, and hard disk drive failures in a large scale datacenter. We present a dense storage case study from a population housing thousands of servers and tens of thousands of disk drives, hosting a large-scale online service at Microsoft. We specifically establish correlation between temperatures and failures observed at different location granularities: (a) inside drive locations in a server chassis, (b) across server locations in a rack, and (c) across multiple racks in a datacenter. We show that temperature exhibits a stronger correlation to failures than the correlation of disk utilization with drive failures. We establish that variations in temperature are not significant in datacenters and have little impact on failures. We also explore workload impacts on temperature and disk failures and show that the impact of workload is not significant. We then experimentally evaluate knobs that control disk drive temperature, including workload and chassis design knobs. We corroborate our findings from the real data study and show that workload knobs show minimal impact on temperature. Chassis knobs like disk placement and fan speeds have a larger impact on temperature. Finally, we also show the proposed cost benefit of temperature optimizations that increase hard disk drive reliability.

References

  1. Cole, G. 2000. Estimating drive reliability in desktop computers and consumer electronics systems. Seagate Tech. rep. TP-338.1.Google ScholarGoogle Scholar
  2. Elerath, J. G. and Shah, S. 2004. Server class disk drives: How reliable are they? In Proceedings of the Annual Symposium on Reliability and Maintainability. 151--156.Google ScholarGoogle Scholar
  3. El-Sayed, N., Stefanovici, I. A., Amvrosiadis, G., Hwang, A. A., and Schroeder, B. 2012. Temperature management in data centers: Why some (might) like it hot. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Facebook 2011. Open compute project at Facebook. http://opencompute.org/.Google ScholarGoogle Scholar
  5. Govindan, M. S. S., Lefurgy, C., and Dholakia, A. 2009. Using on-line power modeling for server power capping. In Proceedings of the Workshop on Energy-Efficient Design (WEED).Google ScholarGoogle Scholar
  6. Gray, J. and Van Ingen, C. 2005. Empirical measurements of disk failure rates and error rates. Tech. rep. MSR-TR-2005-166, Microsoft Research.Google ScholarGoogle Scholar
  7. Greenberg, S., Mills, E., Tschudi, W., Rumsey, P., and Myatt, B. 2006. Best practices for data centers: Lessons learned from benchmarking 22 data centers. ACEEE Summer Study on Energy Efficiency in Buildings.Google ScholarGoogle Scholar
  8. Guo, G. and Zhang, J. 2003. Feedforward control for reducing disk-flutter-induced track misregistration. IEEE Trans. Magn. 39, 4, 2103--2108.Google ScholarGoogle ScholarCross RefCross Ref
  9. Gurumurthi, S., Zhang, J., Sivasubramaniam, A., Kandemir, M., Franke, H., Vijaykrishnan, N., and Irwin, M. 2003. Interplay of energy and performance for disk arrays running transaction processing workloads. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 123--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gurumurthi, S., Sivasubramaniam, A., and Natarajan, V. 2005 Disk drive roadmap from the thermal perspective: A case for dynamic thermal management. In Proceedings of the International Symposium on Computer Architecture (ISCA). 38--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hamilton, J. 2007. An architecture for modular data centers. In Proceedings of CIDR.Google ScholarGoogle Scholar
  12. Hamilton, J. 2008. Datacenter TCO Model. http://perspectives.mvdirona.com.Google ScholarGoogle Scholar
  13. Hoelzle, U. and Barroso, L. A. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. HP. 2003. Assessing and comparing serial attached SCSI and Serial ATA hard disk drives and SAS interface. White paper.Google ScholarGoogle Scholar
  15. HP. 2011. SSA70 Storage Disk Enclosure, h18006.www1.hp.com/storage/disk_storage/index.html.Google ScholarGoogle Scholar
  16. Intel. 2008. Reducing data center cost with an air economizer. Intel.Google ScholarGoogle Scholar
  17. IOMeter. 2011. IOMeter project---www.iometer.org.Google ScholarGoogle Scholar
  18. Kim, Y., Gurumurthi, S., and Sivasubramaniam, A. 2006. Understanding the performance-temperature interactions in disk I/O of server workloads. In Proceedings of the International Symposium on High Performance Computer Architecture. 179--189.Google ScholarGoogle Scholar
  19. Microsoft. 2009. Microsoft’s chiller-less data center. Datacenter Knowl.Google ScholarGoogle Scholar
  20. Namek, R. Y. and Fournier, E. 2011. Two strategies to reduce chiller power and plant energy consumption in datacenters. DatacenterDynamics.Google ScholarGoogle Scholar
  21. Park, I. and Buch, R. 2007. Improve debugging and performance tuning with ETW. Microsoft Corporation.Google ScholarGoogle Scholar
  22. Patterson, M. K. 2008. The effect of data center temperature on energy efficiency. In Proceedings of the 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems. 1167--1174.Google ScholarGoogle ScholarCross RefCross Ref
  23. Pinheiro, E., Weber, W. D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the FAST Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sankar, S., Gurumurthi, S., and Stan, M. R. 2008. Intra-disk parallelism: An idea whose time has come. In Proceedings of the International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Schroeder, B. and Gibson, G. 2006. A large scale study of failures in high-performance-computing systems. In Proceedings of International Symposium on Dependable Systems and Networks (DSN). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Schroeder, B. and Gibson, G. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies. 13--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Schroeder, B., Pinheiro, E., and Weber, W. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Schwartz, T., Baker, M., Bassi, S., Baumgart, B., Flagg, W., Van Ingen, C., Joste, K., Nasse, M., and Shah, M. 2006. Disk failure investigations at the Internet archive. In Proceedings of 14th NASA Goddard, 23rd IEEE Conference on Mass Storage Systems and Technologies.Google ScholarGoogle Scholar
  29. Seagate. 2011. Seagate Constellation ES drive datasheet.Google ScholarGoogle Scholar
  30. Vishwanath, K. V. and Nagappan, N. 2010. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yang, J. and Sun, F. 1999. A comprehensive review of hard-disk drive reliability. In Proceedings of the Annual Symposium on Reliability and Maintainability. 403--409.Google ScholarGoogle Scholar

Index Terms

  1. Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Storage
            ACM Transactions on Storage  Volume 9, Issue 2
            July 2013
            89 pages
            ISSN:1553-3077
            EISSN:1553-3093
            DOI:10.1145/2491472
            Issue’s Table of Contents

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 July 2013
            • Accepted: 1 October 2012
            • Revised: 1 September 2012
            • Received: 1 February 2012
            Published in tos Volume 9, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader