Abstract
With the advent of cloud computing and online services, large enterprises rely heavily on their datacenters to serve end users. A large datacenter facility incurs increased maintenance costs in addition to service unavailability when there are increased failures. Among different server components, hard disk drives are known to contribute significantly to server failures; however, there is very little understanding of the major determinants of disk failures in datacenters. In this work, we focus on the interrelationship between temperature, workload, and hard disk drive failures in a large scale datacenter. We present a dense storage case study from a population housing thousands of servers and tens of thousands of disk drives, hosting a large-scale online service at Microsoft. We specifically establish correlation between temperatures and failures observed at different location granularities: (a) inside drive locations in a server chassis, (b) across server locations in a rack, and (c) across multiple racks in a datacenter. We show that temperature exhibits a stronger correlation to failures than the correlation of disk utilization with drive failures. We establish that variations in temperature are not significant in datacenters and have little impact on failures. We also explore workload impacts on temperature and disk failures and show that the impact of workload is not significant. We then experimentally evaluate knobs that control disk drive temperature, including workload and chassis design knobs. We corroborate our findings from the real data study and show that workload knobs show minimal impact on temperature. Chassis knobs like disk placement and fan speeds have a larger impact on temperature. Finally, we also show the proposed cost benefit of temperature optimizations that increase hard disk drive reliability.
- Cole, G. 2000. Estimating drive reliability in desktop computers and consumer electronics systems. Seagate Tech. rep. TP-338.1.Google Scholar
- Elerath, J. G. and Shah, S. 2004. Server class disk drives: How reliable are they? In Proceedings of the Annual Symposium on Reliability and Maintainability. 151--156.Google Scholar
- El-Sayed, N., Stefanovici, I. A., Amvrosiadis, G., Hwang, A. A., and Schroeder, B. 2012. Temperature management in data centers: Why some (might) like it hot. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google ScholarDigital Library
- Facebook 2011. Open compute project at Facebook. http://opencompute.org/.Google Scholar
- Govindan, M. S. S., Lefurgy, C., and Dholakia, A. 2009. Using on-line power modeling for server power capping. In Proceedings of the Workshop on Energy-Efficient Design (WEED).Google Scholar
- Gray, J. and Van Ingen, C. 2005. Empirical measurements of disk failure rates and error rates. Tech. rep. MSR-TR-2005-166, Microsoft Research.Google Scholar
- Greenberg, S., Mills, E., Tschudi, W., Rumsey, P., and Myatt, B. 2006. Best practices for data centers: Lessons learned from benchmarking 22 data centers. ACEEE Summer Study on Energy Efficiency in Buildings.Google Scholar
- Guo, G. and Zhang, J. 2003. Feedforward control for reducing disk-flutter-induced track misregistration. IEEE Trans. Magn. 39, 4, 2103--2108.Google ScholarCross Ref
- Gurumurthi, S., Zhang, J., Sivasubramaniam, A., Kandemir, M., Franke, H., Vijaykrishnan, N., and Irwin, M. 2003. Interplay of energy and performance for disk arrays running transaction processing workloads. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 123--132. Google ScholarDigital Library
- Gurumurthi, S., Sivasubramaniam, A., and Natarajan, V. 2005 Disk drive roadmap from the thermal perspective: A case for dynamic thermal management. In Proceedings of the International Symposium on Computer Architecture (ISCA). 38--49. Google ScholarDigital Library
- Hamilton, J. 2007. An architecture for modular data centers. In Proceedings of CIDR.Google Scholar
- Hamilton, J. 2008. Datacenter TCO Model. http://perspectives.mvdirona.com.Google Scholar
- Hoelzle, U. and Barroso, L. A. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers. Google ScholarDigital Library
- HP. 2003. Assessing and comparing serial attached SCSI and Serial ATA hard disk drives and SAS interface. White paper.Google Scholar
- HP. 2011. SSA70 Storage Disk Enclosure, h18006.www1.hp.com/storage/disk_storage/index.html.Google Scholar
- Intel. 2008. Reducing data center cost with an air economizer. Intel.Google Scholar
- IOMeter. 2011. IOMeter project---www.iometer.org.Google Scholar
- Kim, Y., Gurumurthi, S., and Sivasubramaniam, A. 2006. Understanding the performance-temperature interactions in disk I/O of server workloads. In Proceedings of the International Symposium on High Performance Computer Architecture. 179--189.Google Scholar
- Microsoft. 2009. Microsoft’s chiller-less data center. Datacenter Knowl.Google Scholar
- Namek, R. Y. and Fournier, E. 2011. Two strategies to reduce chiller power and plant energy consumption in datacenters. DatacenterDynamics.Google Scholar
- Park, I. and Buch, R. 2007. Improve debugging and performance tuning with ETW. Microsoft Corporation.Google Scholar
- Patterson, M. K. 2008. The effect of data center temperature on energy efficiency. In Proceedings of the 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems. 1167--1174.Google ScholarCross Ref
- Pinheiro, E., Weber, W. D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the FAST Conference on File and Storage Technologies. Google ScholarDigital Library
- Sankar, S., Gurumurthi, S., and Stan, M. R. 2008. Intra-disk parallelism: An idea whose time has come. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
- Schroeder, B. and Gibson, G. 2006. A large scale study of failures in high-performance-computing systems. In Proceedings of International Symposium on Dependable Systems and Networks (DSN). Google ScholarDigital Library
- Schroeder, B. and Gibson, G. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies. 13--16. Google ScholarDigital Library
- Schroeder, B., Pinheiro, E., and Weber, W. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems. Google ScholarDigital Library
- Schwartz, T., Baker, M., Bassi, S., Baumgart, B., Flagg, W., Van Ingen, C., Joste, K., Nasse, M., and Shah, M. 2006. Disk failure investigations at the Internet archive. In Proceedings of 14th NASA Goddard, 23rd IEEE Conference on Mass Storage Systems and Technologies.Google Scholar
- Seagate. 2011. Seagate Constellation ES drive datasheet.Google Scholar
- Vishwanath, K. V. and Nagappan, N. 2010. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC). Google ScholarDigital Library
- Yang, J. and Sun, F. 1999. A comprehensive review of hard-disk drive reliability. In Proceedings of the Annual Symposium on Reliability and Maintainability. 403--409.Google Scholar
Index Terms
- Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures
Recommendations
Live Migration Impact on Virtual Datacenter Performance: Vmware vMotion Based Study
FICLOUD '14: Proceedings of the 2014 International Conference on Future Internet of Things and CloudCloud computing is the future wave of information technology that provides infrastructure, platform and application as on demand services with low cost and rapid scalability. Infrastructure resources virtualization is the backbone of cloud computing to ...
Impact of temperature on hard disk drive reliability in large datacenters
DSN '11: Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&NetworksWhen datacenters are pushed to their limits of operational efficiency, reducing failure rates becomes critical for maintaining high levels of healthy server operation. In this experience report, we present a dense storage case study from a large ...
Enabling Instantaneous Relocation of Virtual Machines with a Lightweight VMM Extension
CCGRID '10: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid ComputingWe are developing an efficient resource management system with aggressive virtual machine (VM) relocation among physical nodes in a data center. Existing live migration technology, however, requires a long time to change the execution host of a VM, it ...
Comments