Abstract
In this article, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bicriteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via Dynamic Voltage and Frequency Scaling (DVFS). In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.
- Susanne Albers and Hiroshi Fujiwara. 2007. Energy-efficient algorithms for flow time minimization. ACM Transactions on Algorithms 3, 4, Article 49 (2007). Google ScholarDigital Library
- Ismail Assayad, Alain Girault, and Hamoudi Kalla. 2013. Tradeoff exploration between reliability, power consumption, and execution time for embedded systems. International Journal on Software Tools for Technology Transfer 15, 3 (2013), 229--245.Google ScholarDigital Library
- Guillaume Aupy, Anne Benoit, Thomas Hérault, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2013. On the combination of silent error detection and checkpointing. In Proceedings of the 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC). 11--20. Google ScholarDigital Library
- Guillaume Aupy, Anne Benoit, and Yves Robert. 2012. Energy-aware scheduling under reliability and makespan constraints. In Proceedings of the International Conference on High Performance Computing (HiPC). 1--10.Google ScholarCross Ref
- Nikhil Bansal, Tracy Kimbrel, and Kirk Pruhs. 2007. Speed scaling to manage energy and temperature. Journal of the ACM 54, 1 (2007), 3:1--3:39. Google ScholarDigital Library
- Austin R. Benson, Sven Schmit, and Robert Schreiber. 2014. Silent error detection in numerical time-stepping schemes. The International Journal of High Performance Computing Applications DOI:10.1177/1094342014532297. Google ScholarDigital Library
- George Bosilca, Rémi Delmas, Jack Dongarra, and Julien Langou. 2009. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing 69, 4 (2009), 410--416. Google ScholarDigital Library
- Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, and Frédéric Vivien. 2011. Checkpointing strategies for parallel jobs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--11. Google ScholarDigital Library
- Greg Bronevetsky and Bronis de Supinski. 2008. Soft error vulnerability of iterative linear algebra methods. In Proceedings of the International Conference on Supercomputing (ICS). 155--164. Google ScholarDigital Library
- K. Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems 3, 1 (1985), 63--75. Google ScholarDigital Library
- Zizhong Chen. 2013. Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 167--176. Google ScholarDigital Library
- J. T. Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22, 3 (2006), 303--312. Google ScholarCross Ref
- A. Das, A. Kumar, B. Veeravalli, C. Bolchini, and A. Miele. 2014. Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE). 61:1--61:6. Google ScholarDigital Library
- Anand Dixit and Alan Wood. 2011. The impact of new technology on soft error rates. In IEEE International Reliability Physics Symposium (IRPS). 5B.4.1--5B.4.7.Google ScholarCross Ref
- Daniel R. Dooly, Sally A. Goldman, and Stephen D. Scott. 2001. On-line analysis of the TCP acknowledgment delay problem. Journal of the ACM 48, 2 (2001), 243--273. Google ScholarDigital Library
- Nosayba El-Sayed, Ioan A. Stefanovici, George Amvrosiadis, Andy A. Hwang, and Bianca Schroeder. 2012. Temperature management in data centers: Why some (might) like it hot. SIGMETRICS Performance Evaluation Review 40, 1 (2012), 163--174. Google ScholarDigital Library
- James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. 2012. Combining partial redundancy and checkpointing for HPC. In Proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS). 615--626. Google ScholarDigital Library
- E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Survey 34, 3 (2002), 375--408. Google ScholarDigital Library
- Alex Fabrikant, Ankur Luthra, Elitza Maneva, Christos H. Papadimitriou, and Scott Shenker. 2003. On a network creation game. In Proceedings of the 22nd Annual Symposium on Principles of Distributed Computing (PODC’03). 347--351. Google ScholarDigital Library
- Wu-Chun Feng. 2003. Making a case for efficient supercomputing. Queue 1, 7 (Oct. 2003), 54--64. Google ScholarDigital Library
- David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 78. Google ScholarDigital Library
- Michael A. Heroux and Mark Hoemmen. 2011. Fault-Tolerant Iterative Methods via Selective Reliability. Research report SAND2011-3915 C. Sandia National Laboratories.Google Scholar
- Chung-hsing Hsu and Wu-chun Feng. 2005. A power-aware run-time system for high-performance computing. In Proceedings of the ACM/IEEE Supercomputing Conference (SC). 1--9. Google ScholarDigital Library
- Kuang-Hua Huang and Jacob A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 33, 6 (1984), 518--528. Google ScholarDigital Library
- Andy A. Hwang, Ioan A. Stefanovici, and Bianca Schroeder. 2012. Cosmic rays don’t strike twice: Understanding the nature of DRAM errors and the implications for system design. SIGARCH Computer Architecture News 40, 1 (2012), 111--122. Google ScholarDigital Library
- Thomas Hérault and Yves Robert (Eds.). 2015. Fault-Tolerance Techniques for High-Performance Computing. Springer Verlag. Google ScholarDigital Library
- Mehdi Kargar, Aijun An, and Morteza Zihayat. 2012. Efficient bi-objective team formation in social networks. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases—Volume Part II (ECML PKDD’12). 483--498.Google ScholarCross Ref
- Guoming Lu, Ziming Zheng, and Andrew A. Chien. 2013. When is multi-version checkpointing needed? In Proceedings of the 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS). 49--56. Google ScholarDigital Library
- R. E. Lyons and W. Vanderkulk. 1962. The use of triple-modular redundancy to improve computer reliability. IBM Journal of Research and Development 6, 2 (1962), 200--209. Google ScholarDigital Library
- Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R. de Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the ACM/IEEE SC Conference. 1--11. Google ScholarDigital Library
- Xiang Ni, Esteban Meneses, Nikhil Jain, and Laxmikant V. Kalé. 2013. ACR: Automatic checkpoint/restart for soft and hard error protection. In Prococeedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’13). ACM. Google ScholarDigital Library
- Timothy J. O'Gorman. 1994. The effect of cosmic rays on the soft error rate of a DRAM at ground level. IEEE Transactions on Electron Devices 41, 4 (1994), 553--557.Google ScholarCross Ref
- Tatsuya Ozaki, Tadashi Dohi, Hiroyuki Okamura, and Naoto Kaio. 2006. Distribution-free checkpoint placement algorithms based on min-max principle. IEEE Transactions on Dependable and Secure Computing 3, 2 (2006), 130--140. Google ScholarDigital Library
- Michael K. Patterson. 2008. The effect of data center temperature on energy efficiency. In Proceedings of the 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems. 1167--1174.Google ScholarCross Ref
- Nikzad Babaii Rizvandi, Albert Y. Zomaya, Young Choon Lee, Ali Javadzadeh Boloori, and Javid Taheri. 2012. Multiple frequency selection in DVFS-enabled processors to minimize energy consumption. In Energy-Efficient Distributed Computing Systems, A. Y. Zomaya and Y. C. Lee (Eds.). John Wiley & Sons, Inc., Hoboken, NJ.Google Scholar
- Piyush Sao and Richard Vuduc. 2013. Self-stabilizing iterative solvers. In Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA). Google ScholarDigital Library
- Osman Sarood, Esteban Meneses, and Laxmikant V. Kale. 2013. A ‘cool’ way of improving the reliability of HPC machines. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 58:1--58:12. Google ScholarDigital Library
- Manu Shantharam, Sowmyalatha Srinivasmurthy, and Padma Raghavan. 2012. Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In Proceedings of the ACM International Conference on Supercomputing (ICS). 69--78. Google ScholarDigital Library
- Sam Toueg and Özalp Babaoglu. 1984. On the optimum checkpoint selection problem. SIAM Journal on Computing 13, 3 (1984), 630--649. Google ScholarDigital Library
- Frances Yao, Alan Demers, and Scott Shenker. 1995. A scheduling model for reduced CPU energy. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS). 374. Google ScholarDigital Library
- John W. Young. 1974. A first order approximation to the optimum checkpoint interval. Communications of the ACM 17, 9 (1974), 530--531. Google ScholarDigital Library
- Baoxian Zhao, Hakan Aydin, and Dakai Zhu. 2008. Reliability-aware dynamic voltage scaling for energy-constrained real-time embedded systems. In Proceedings of the IEEE International Conference on Computer Design (ICCD). 633--639.Google ScholarCross Ref
- Dakai Zhu, R. Melhem, and D. Mosse. 2004. The effects of energy management on reliability in real-time embedded systems. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 35--40. Google ScholarDigital Library
- J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, T. J. O’Gorman, and J. M. Ross. 1996b. Accelerated testing for cosmic soft-error rate. IBM Journal of Research and Development 40, 1 (1996), 51--72. Google ScholarDigital Library
- J. F. Ziegler, M. E. Nelson, J. D. Shell, R. J. Peterson, C. J. Gelderloos, H. P. Muhlfeld, and C. J. Montrose. 1998. Cosmic ray soft error rates of 16-Mb DRAM memory chips. IEEE Journal of Solid-State Circuits 33, 2 (1998), 246--252.Google ScholarCross Ref
- J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin. 1996a. IBM experiments in soft fails in computer electronics. IBM Journal of Research and Development 40, 1 (1996), 3--18. Google ScholarDigital Library
Index Terms
- Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors
Recommendations
Efficient checkpoint/verification patterns
Errors have become a critical problem for high-performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their peculiarity is that such errors are identified ...
Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors
High Performance Computing Systems. Performance Modeling, Benchmarking, and SimulationAbstractIn this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to address both fail-stop and silent errors. The objective is to minimize either makespan or energy consumption. While DVFS is a ...
A generic approach to scheduling and checkpointing workflows
This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous ...
Comments