|
ABSTRACT
Batch-correlated failures result from the manifestation of a common defect in most, if not all, disk drives belonging to the same production batch. They are much less frequent than random disk failures but can cause catastrophic data losses even in systems that rely on mirroring or erasure codes to protect their data. We propose to reduce impact of batch-correlated failures on disk arrays by storing redundant copies of the same data on disks from different batches and, possibly, different manufacturers. The technique is especially attractive for mirrored organizations as it only requires that the two disks that hold copies of the same data never belong to the same production batch. We also show that even partial diversity can greatly increase the probability that the data stored in a RAID array will survive batch-correlated failures.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
M. Baker, M. Shah, D.S.H. Rosenthal, M. Roussopoulos, P. Maniatis, T.J. Giuli, and P. Bungale. A Fresh Look at the Reliability of Long-Term Storage. In Proc. First EuroSys Conference (EuroSys 2006), Leuven, Belgium, Apr. 2006.
|
| |
2
|
W. Burkhard and J. Menon. Disk Array Storage System Reliability. In Proceedings of the 23rd Annual International Symposium on Fault-Tolerant Computing (FTCS-23), Toulouse, France, June 1993, 432--441.
|
 |
3
|
Peter M. Chen , Edward K. Lee , Garth A. Gibson , Randy H. Katz , David A. Patterson, RAID: high-performance, reliable secondary storage, ACM Computing Surveys (CSUR), v.26 n.2, p.145-185, June 1994
[doi> 10.1145/176979.176981]
|
| |
4
|
Peter Corbett , Bob English , Atul Goel , Tomislav Grcanac , Steven Kleiman , James Leong , Sunitha Sankar, Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction, Proceedings of the 3rd USENIX Conference on File and Storage Technologies, March 31-31, 2004, San Francisco, CA
|
| |
5
|
J.G. Elerath. Specifying Reliability in the Disk Drive Industry: No More MTBF's. In Proceedings of the 46th Annual Reliability and Maintainability Symposium (RAMS 2000), Jan. 2000, 194--199.
|
| |
6
|
J.G. Elerath and S. Shah. Server Class Disk Drives: How Reliable Are They? In Proceedings of the 50th Annual Reliability & Maintainability Symposium (RAMS 2004), Jan. 2004, 151--156.
|
 |
7
|
David A. Patterson , Garth Gibson , Randy H. Katz, A case for redundant arrays of inexpensive disks (RAID), Proceedings of the 1988 ACM SIGMOD international conference on Management of data, p.109-116, June 01-03, 1988, Chicago, Illinois, United States
|
| |
8
|
T.J.E. Schwarz, S.J. and W.A. Burkhard. RAID Organization and Performance. In Proceedings of the 12th International Conference on Distributed Computing Systems, Yokohama, Japan, June 1992, 318--325.
|
| |
9
|
Thomas J. E. Schwarz , Qin Xin , Ethan L. Miller , Darrell D. E. Long , Andy Hospodor , Spencer Ng, Disk Scrubbing in Large Archival Storage Systems, Proceedings of the The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems (MASCOTS'04), p.409-418, October 04-08, 2004
|
| |
10
|
S. Shah and J.G. Elerath. Disk Drive Vintage and Its Effect on Reliability. In Proceedings of the 50th Annual Reliability & Maintainability Symposium (RAMS 2004), Jan. 2004, 163--165.
|
| |
11
|
S. Shah and J.G. Elerath. Reliability Analysis of Disk Drive Failure Mechanisms. In Proceedings of the 51st Annual Reliability & Maintainability Symposium (RAMS 2005), Jan. 2005, 226--231.
|
| |
12
|
|
|