|
ABSTRACT
Restoring data operations after a disaster is a daunting task: how should recovery be performed to minimize data loss and application downtime? Administrators are under considerable pressure to recover quickly, so they lack time to make good scheduling decisions. They schedule recovery based on rules of thumb, or on pre-determined orders that might not be best for the failure occurrence. With multiple workloads and recovery techniques, the number of possibilities is large, so the decision process is not trivial.This paper makes several contributions to the area of data recovery scheduling. First, we formalize the description of potential recovery processes by defining recovery graphs. Recovery graphs explicitly capture alternative approaches for recovering workloads, including their recovery tasks, operational states, timing information and precedence relationships. Second, we formulate the data recovery scheduling problem as an optimization problem, where the goal is to find the schedule that minimizes the financial penalties due to downtime, data loss and vulnerability to subsequent failures. Third, we present several methods for finding optimal or near-optimal solutions, including priority-based, randomized and genetic algorithm-guided ad hoc heuristics. We quantitatively evaluate these methods using realistic storage system designs and workloads, and compare the quality of the algorithms' solutions to optimal solutions provided by a math programming formulation and to the solutions from a simple heuristic that emulates the choices made by human administrators. We find that our heuristics' solutions improve on the administrator heuristic's solutions, often approaching or achieving optimality.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Eric Anderson , Dirk Beyer , Kamalika Chaudhuri , Terence Kelly , Norman Salazar , Cipriano Santos , Ram Swaminathan , Robert Tarjan , Janet Wiener , Yunhong Zhou, Value-maximizing deadline scheduling and its application to animation rendering, Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, July 18-20, 2005, Las Vegas, Nevada, USA
[doi> 10.1145/1073970.1074019]
|
| |
2
|
A. Azagury, M. Factor, and J. Satran. Point-in-Time copy: yesterday, today and tomorrow. In Proc. 10th NASA Conf. on Mass Storage Systems and Technologies/19th IEEE Symp. on Mass Storage Systems, pages 259--270, April 2002.
|
| |
3
|
K. R. Baker. Introduction to sequencing and scheduling. John Wiley, 1974.
|
| |
4
|
E. Balas. Project scheduling with resource constraints. In E. Beale, editor, Applications of Mathematical Programming Techniques, pages 187--200. American Elsevier, 1970.
|
| |
5
|
R. Bhagwan, K. Tati, Y. Cheng, S. Savage, and G. Voelker. Total Recall: system support for automated availability management. In Proc. ACM/USENIX Symp. on Networked Systems Design and Implementation (NSDI), March 2004.
|
| |
6
|
P. Brucker, A. Drexl, R. Mohring, K. Neumann, and E. Pesch. Resource constrained project scheduling: notation, classification, models, and methods. European Journal of Operations Research, 112:3--41, 1999.
|
| |
7
|
A. Chervenak, V. Vellanki, and Z. Kurmas. Protecting file systems: a survey of backup techniques. In Proc. 6th NASA Conf. on Mass Storage Systems and Technologies/15th IEEE Symp. on Mass Storage Systems, March 1998.
|
| |
8
|
D. Cougias, E. Heiberger, and K. Koop. The backup book: disaster recovery from desktop to data center. Schaser-Vartan Books, Lecanto, FL, 2003.
|
 |
9
|
|
| |
10
|
C. Ekelin. An optimization framework for scheduling of embedded real-time systems. PhD thesis, Chalmers University of Technology, 2004.
|
| |
11
|
S. Hartmann. A self-adapting genetic algorithm for project scheduling under resource constraints. Naval Research Logistics, 49:433--448, 1001.
|
| |
12
|
Hewlett-Packard Company. HP StorageWorks Enterprise Virtual Array, December 2003. h18006. www 1.hp.com/products/storageworks/enterprise/.
|
| |
13
|
Hewlett Packard Company. HP StorageWorks Extended Tape Library Architecture, December 2003. h 18006. www.1.hp.com/products/storageworks/tlarchitecture/.
|
| |
14
|
Hewlett-Packard Development Co. HP OpenView Storage Data Protector administrator's guide, October 2004. Mfg. Part Number B6960--90106, Release A.05.50.
|
| |
15
|
|
| |
16
|
ILOG, Inc., Mountain View, CA. CPLEX 8.0 User's Manual, July 2002. Available from http://www.ilog.com.
|
| |
17
|
M. Ji, A. Veitch, and J. Wilkes. Seneca: remote mirroring done write. In Proc. USENIX Annual Technical Conf., pages 253--268, June 2003.
|
 |
18
|
Kimberly Keeton , Dirk Beyer , Jeffrey Chase , Arif Merchant , Cipriano Santos , John Wilkes, Lessons and challenges in automating data dependability, Proceedings of the 11th workshop on ACM SIGOPS European workshop: beyond the PC, September 19-22, 2004, Leuven, Belgium
[doi> 10.1145/1133572.1133591]
|
| |
19
|
|
 |
20
|
|
| |
21
|
Kimberley Keeton , Cipriano Santos , Dirk Beyer , Jeffrey Chase , John Wilkes, Designing for Disasters, Proceedings of the 3rd USENIX Conference on File and Storage Technologies, March 31-31, 2004, San Francisco, CA
|
| |
22
|
R. Kolisch and S. Hartmann. Heuristic algorithms for the resource-constrainted project scheduling problem: classification and computational analysis. In J. Weglarz, editor, Project scheduling: recent models, algorithms and applications, pages 147--178. Kluwer Academic Publishers, 1999.
|
| |
23
|
Eagle Rock Alliance Ltd. Online survey results: 2001 cost of downtime. http://contingencyplanningresearch.com/2001_ Survey.pdf, August 2001.
|
| |
24
|
E. Marcus and H. Stern. Blueprints for high availability. Wiley Publishing, Indianapolis, IN, 2003.
|
| |
25
|
P. Massiglia and E. Marcus, editors. The resilient enterprise: recovering information services from disaster. Veritas Software Corp., Mountain View, CA, USA, 2002.
|
| |
26
|
|
 |
27
|
David A. Patterson , Garth Gibson , Randy H. Katz, A case for redundant arrays of inexpensive disks (RAID), Proceedings of the 1988 ACM SIGMOD international conference on Management of data, p.109-116, June 01-03, 1988, Chicago, Illinois, United States
|
| |
28
|
M. Pinedo. Planning and scheduling in manufacturing and services. Springer Series in Operations Research. Springer-Verlag, 2005.
|
 |
29
|
Yasushi Saito , Svend Frølund , Alistair Veitch , Arif Merchant , Susan Spence, FAB: building distributed enterprise disk arrays from commodity components, Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, October 07-13, 2004, Boston, MA, USA
|
| |
30
|
R. R. Schulman. Disaster recovery issues and solutions. Hitachi Data Systems White paper, September 2004.
|
| |
31
|
|
| |
32
|
|
| |
33
|
C. Warrick et al. IBM TotalStorage business continuity solutions guide. IBM Redbooks. IBM International Technical Support Organization, August 2005.
|
| |
34
|
Jay J. Wylie , Michael W. Bigrigg , John D. Strunk , Gregory R. Ganger , Han Kiliççöte , Pradeep K. Khosla, Survivable Information Storage Systems, Computer, v.33 n.8, p.61-68, August 2000
[doi> 10.1109/2.863969
]
|
| |
35
|
|
| |
36
|
|
| |
37
|
|
CITED BY 3
|
|
|
|
Kimberly Keeton , Terence Kelly , Arif Merchant , Cipriano Santos , Janet Wiener , Xiaoyun Zhu , Dirk Beyer, Don't settle for less than the best: use optimization to make decisions, Proceedings of the 11th USENIX workshop on Hot topics in operating systems, p.1-6, May 07-09, 2007, San Diego, CA
|
|
|
John D. Strunk , Eno Thereska , Christos Faloutsos , Gregory R. Ganger, Using utility to provision storage systems, Proceedings of the 6th USENIX Conference on File and Storage Technologies, p.1-16, February 26-29, 2008, San Jose, California
|
INDEX TERMS
Primary Classification:
D.
Software
D.4
OPERATING SYSTEMS
D.4.5
Reliability
Additional Classification:
G.
Mathematics of Computing
G.1
NUMERICAL ANALYSIS
G.1.6
Optimization
K.
Computing Milieux
K.6
MANAGEMENT OF COMPUTING AND INFORMATION SYSTEMS
General Terms:
Algorithms,
Design,
Management,
Reliability
Keywords:
backup/restore,
data storage,
disaster recovery,
genetic algorithms,
management,
math programming,
optimization,
scheduling
|