research-article

Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

Authors:
Sriram Sankar

Microsoft Corporation

Microsoft Corporation
View Profile

,
Mark Shaw

Microsoft Corporation

Microsoft Corporation
View Profile

,
Kushagra Vaid

Microsoft Corporation

Microsoft Corporation
View Profile

,
Sudhanva Gurumurthi

University of Virginia

University of Virginia
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 9 Issue 2Article No.: 6pp 1–24https://doi.org/10.1145/2491472.2491475

Published:01 July 2013Publication History

ACM Transactions on Storage

Abstract

With the advent of cloud computing and online services, large enterprises rely heavily on their datacenters to serve end users. A large datacenter facility incurs increased maintenance costs in addition to service unavailability when there are increased failures. Among different server components, hard disk drives are known to contribute significantly to server failures; however, there is very little understanding of the major determinants of disk failures in datacenters. In this work, we focus on the interrelationship between temperature, workload, and hard disk drive failures in a large scale datacenter. We present a dense storage case study from a population housing thousands of servers and tens of thousands of disk drives, hosting a large-scale online service at Microsoft. We specifically establish correlation between temperatures and failures observed at different location granularities: (a) inside drive locations in a server chassis, (b) across server locations in a rack, and (c) across multiple racks in a datacenter. We show that temperature exhibits a stronger correlation to failures than the correlation of disk utilization with drive failures. We establish that variations in temperature are not significant in datacenters and have little impact on failures. We also explore workload impacts on temperature and disk failures and show that the impact of workload is not significant. We then experimentally evaluate knobs that control disk drive temperature, including workload and chassis design knobs. We corroborate our findings from the real data study and show that workload knobs show minimal impact on temperature. Chassis knobs like disk placement and fan speeds have a larger impact on temperature. Finally, we also show the proposed cost benefit of temperature optimizations that increase hard disk drive reliability.

References

Cole, G. 2000. Estimating drive reliability in desktop computers and consumer electronics systems. Seagate Tech. rep. TP-338.1.Google Scholar
Elerath, J. G. and Shah, S. 2004. Server class disk drives: How reliable are they? In Proceedings of the Annual Symposium on Reliability and Maintainability. 151--156.Google Scholar
El-Sayed, N., Stefanovici, I. A., Amvrosiadis, G., Hwang, A. A., and Schroeder, B. 2012. Temperature management in data centers: Why some (might) like it hot. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google ScholarDigital Library
Facebook 2011. Open compute project at Facebook. http://opencompute.org/.Google Scholar
Govindan, M. S. S., Lefurgy, C., and Dholakia, A. 2009. Using on-line power modeling for server power capping. In Proceedings of the Workshop on Energy-Efficient Design (WEED).Google Scholar
Gray, J. and Van Ingen, C. 2005. Empirical measurements of disk failure rates and error rates. Tech. rep. MSR-TR-2005-166, Microsoft Research.Google Scholar
Greenberg, S., Mills, E., Tschudi, W., Rumsey, P., and Myatt, B. 2006. Best practices for data centers: Lessons learned from benchmarking 22 data centers. ACEEE Summer Study on Energy Efficiency in Buildings.Google Scholar
Guo, G. and Zhang, J. 2003. Feedforward control for reducing disk-flutter-induced track misregistration. IEEE Trans. Magn. 39, 4, 2103--2108.Google ScholarCross Ref
Gurumurthi, S., Zhang, J., Sivasubramaniam, A., Kandemir, M., Franke, H., Vijaykrishnan, N., and Irwin, M. 2003. Interplay of energy and performance for disk arrays running transaction processing workloads. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 123--132. Google ScholarDigital Library
Gurumurthi, S., Sivasubramaniam, A., and Natarajan, V. 2005 Disk drive roadmap from the thermal perspective: A case for dynamic thermal management. In Proceedings of the International Symposium on Computer Architecture (ISCA). 38--49. Google ScholarDigital Library
Hamilton, J. 2007. An architecture for modular data centers. In Proceedings of CIDR.Google Scholar
Hamilton, J. 2008. Datacenter TCO Model. http://perspectives.mvdirona.com.Google Scholar
Hoelzle, U. and Barroso, L. A. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers. Google ScholarDigital Library
HP. 2003. Assessing and comparing serial attached SCSI and Serial ATA hard disk drives and SAS interface. White paper.Google Scholar
HP. 2011. SSA70 Storage Disk Enclosure, h18006.www1.hp.com/storage/disk_storage/index.html.Google Scholar
Intel. 2008. Reducing data center cost with an air economizer. Intel.Google Scholar
IOMeter. 2011. IOMeter project---www.iometer.org.Google Scholar
Kim, Y., Gurumurthi, S., and Sivasubramaniam, A. 2006. Understanding the performance-temperature interactions in disk I/O of server workloads. In Proceedings of the International Symposium on High Performance Computer Architecture. 179--189.Google Scholar
Microsoft. 2009. Microsoft’s chiller-less data center. Datacenter Knowl.Google Scholar
Namek, R. Y. and Fournier, E. 2011. Two strategies to reduce chiller power and plant energy consumption in datacenters. DatacenterDynamics.Google Scholar
Park, I. and Buch, R. 2007. Improve debugging and performance tuning with ETW. Microsoft Corporation.Google Scholar
Patterson, M. K. 2008. The effect of data center temperature on energy efficiency. In Proceedings of the 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems. 1167--1174.Google ScholarCross Ref
Pinheiro, E., Weber, W. D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the FAST Conference on File and Storage Technologies. Google ScholarDigital Library
Sankar, S., Gurumurthi, S., and Stan, M. R. 2008. Intra-disk parallelism: An idea whose time has come. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
Schroeder, B. and Gibson, G. 2006. A large scale study of failures in high-performance-computing systems. In Proceedings of International Symposium on Dependable Systems and Networks (DSN). Google ScholarDigital Library
Schroeder, B. and Gibson, G. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies. 13--16. Google ScholarDigital Library
Schroeder, B., Pinheiro, E., and Weber, W. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems. Google ScholarDigital Library
Schwartz, T., Baker, M., Bassi, S., Baumgart, B., Flagg, W., Van Ingen, C., Joste, K., Nasse, M., and Shah, M. 2006. Disk failure investigations at the Internet archive. In Proceedings of 14th NASA Goddard, 23rd IEEE Conference on Mass Storage Systems and Technologies.Google Scholar
Seagate. 2011. Seagate Constellation ES drive datasheet.Google Scholar
Vishwanath, K. V. and Nagappan, N. 2010. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC). Google ScholarDigital Library
Yang, J. and Sun, F. 1999. A comprehensive review of hard-disk drive reliability. In Proceedings of the Annual Symposium on Reliability and Maintainability. 403--409.Google Scholar

Index Terms

Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

Recommendations

Live Migration Impact on Virtual Datacenter Performance: Vmware vMotion Based Study
FICLOUD '14: Proceedings of the 2014 International Conference on Future Internet of Things and Cloud

Cloud computing is the future wave of information technology that provides infrastructure, platform and application as on demand services with low cost and rapid scalability. Infrastructure resources virtualization is the backbone of cloud computing to ...
Read More
Impact of temperature on hard disk drive reliability in large datacenters
DSN '11: Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks

When datacenters are pushed to their limits of operational efficiency, reducing failure rates becomes critical for maintaining high levels of healthy server operation. In this experience report, we present a dense storage case study from a large ...
Read More
Enabling Instantaneous Relocation of Virtual Machines with a Lightweight VMM Extension
CCGRID '10: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing

We are developing an efficient resource management system with aggressive virtual machine (VM) relocation among physical nodes in a data center. Existing live migration technology, however, requires a long time to change the execution host of a VM, it ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Storage Volume 9, Issue 2
July 2013
89 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/2491472
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 2013
- Accepted: 1 October 2012
- Revised: 1 September 2012
- Received: 1 February 2012
Published in tos Volume 9, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Datacenter
hard disk drives
temperature impact
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 40
  Total Citations
  View Citations
- 691
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

Live Migration Impact on Virtual Datacenter Performance: Vmware vMotion Based Study

Impact of temperature on hard disk drive reliability in large datacenters

Enabling Instantaneous Relocation of Virtual Machines with a Lightweight VMM Extension

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

Live Migration Impact on Virtual Datacenter Performance: Vmware vMotion Based Study

Impact of temperature on hard disk drive reliability in large datacenters

Enabling Instantaneous Relocation of Virtual Machines with a Lightweight VMM Extension

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media