skip to main content
research-article

Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Published: 20 July 2016 Publication History

Abstract

In this article, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bicriteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via Dynamic Voltage and Frequency Scaling (DVFS). In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.

References

[1]
Susanne Albers and Hiroshi Fujiwara. 2007. Energy-efficient algorithms for flow time minimization. ACM Transactions on Algorithms 3, 4, Article 49 (2007).
[2]
Ismail Assayad, Alain Girault, and Hamoudi Kalla. 2013. Tradeoff exploration between reliability, power consumption, and execution time for embedded systems. International Journal on Software Tools for Technology Transfer 15, 3 (2013), 229--245.
[3]
Guillaume Aupy, Anne Benoit, Thomas Hérault, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2013. On the combination of silent error detection and checkpointing. In Proceedings of the 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC). 11--20.
[4]
Guillaume Aupy, Anne Benoit, and Yves Robert. 2012. Energy-aware scheduling under reliability and makespan constraints. In Proceedings of the International Conference on High Performance Computing (HiPC). 1--10.
[5]
Nikhil Bansal, Tracy Kimbrel, and Kirk Pruhs. 2007. Speed scaling to manage energy and temperature. Journal of the ACM 54, 1 (2007), 3:1--3:39.
[6]
Austin R. Benson, Sven Schmit, and Robert Schreiber. 2014. Silent error detection in numerical time-stepping schemes. The International Journal of High Performance Computing Applications
[7]
George Bosilca, Rémi Delmas, Jack Dongarra, and Julien Langou. 2009. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing 69, 4 (2009), 410--416.
[8]
Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, and Frédéric Vivien. 2011. Checkpointing strategies for parallel jobs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--11.
[9]
Greg Bronevetsky and Bronis de Supinski. 2008. Soft error vulnerability of iterative linear algebra methods. In Proceedings of the International Conference on Supercomputing (ICS). 155--164.
[10]
K. Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems 3, 1 (1985), 63--75.
[11]
Zizhong Chen. 2013. Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 167--176.
[12]
J. T. Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22, 3 (2006), 303--312.
[13]
A. Das, A. Kumar, B. Veeravalli, C. Bolchini, and A. Miele. 2014. Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE). 61:1--61:6.
[14]
Anand Dixit and Alan Wood. 2011. The impact of new technology on soft error rates. In IEEE International Reliability Physics Symposium (IRPS). 5B.4.1--5B.4.7.
[15]
Daniel R. Dooly, Sally A. Goldman, and Stephen D. Scott. 2001. On-line analysis of the TCP acknowledgment delay problem. Journal of the ACM 48, 2 (2001), 243--273.
[16]
Nosayba El-Sayed, Ioan A. Stefanovici, George Amvrosiadis, Andy A. Hwang, and Bianca Schroeder. 2012. Temperature management in data centers: Why some (might) like it hot. SIGMETRICS Performance Evaluation Review 40, 1 (2012), 163--174.
[17]
James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. 2012. Combining partial redundancy and checkpointing for HPC. In Proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS). 615--626.
[18]
E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Survey 34, 3 (2002), 375--408.
[19]
Alex Fabrikant, Ankur Luthra, Elitza Maneva, Christos H. Papadimitriou, and Scott Shenker. 2003. On a network creation game. In Proceedings of the 22nd Annual Symposium on Principles of Distributed Computing (PODC’03). 347--351.
[20]
Wu-Chun Feng. 2003. Making a case for efficient supercomputing. Queue 1, 7 (Oct. 2003), 54--64.
[21]
David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 78.
[22]
Michael A. Heroux and Mark Hoemmen. 2011. Fault-Tolerant Iterative Methods via Selective Reliability. Research report SAND2011-3915 C. Sandia National Laboratories.
[23]
Chung-hsing Hsu and Wu-chun Feng. 2005. A power-aware run-time system for high-performance computing. In Proceedings of the ACM/IEEE Supercomputing Conference (SC). 1--9.
[24]
Kuang-Hua Huang and Jacob A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 33, 6 (1984), 518--528.
[25]
Andy A. Hwang, Ioan A. Stefanovici, and Bianca Schroeder. 2012. Cosmic rays don’t strike twice: Understanding the nature of DRAM errors and the implications for system design. SIGARCH Computer Architecture News 40, 1 (2012), 111--122.
[26]
Thomas Hérault and Yves Robert (Eds.). 2015. Fault-Tolerance Techniques for High-Performance Computing. Springer Verlag.
[27]
Mehdi Kargar, Aijun An, and Morteza Zihayat. 2012. Efficient bi-objective team formation in social networks. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases—Volume Part II (ECML PKDD’12). 483--498.
[28]
Guoming Lu, Ziming Zheng, and Andrew A. Chien. 2013. When is multi-version checkpointing needed? In Proceedings of the 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS). 49--56.
[29]
R. E. Lyons and W. Vanderkulk. 1962. The use of triple-modular redundancy to improve computer reliability. IBM Journal of Research and Development 6, 2 (1962), 200--209.
[30]
Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R. de Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the ACM/IEEE SC Conference. 1--11.
[31]
Xiang Ni, Esteban Meneses, Nikhil Jain, and Laxmikant V. Kalé. 2013. ACR: Automatic checkpoint/restart for soft and hard error protection. In Prococeedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’13). ACM.
[32]
Timothy J. O'Gorman. 1994. The effect of cosmic rays on the soft error rate of a DRAM at ground level. IEEE Transactions on Electron Devices 41, 4 (1994), 553--557.
[33]
Tatsuya Ozaki, Tadashi Dohi, Hiroyuki Okamura, and Naoto Kaio. 2006. Distribution-free checkpoint placement algorithms based on min-max principle. IEEE Transactions on Dependable and Secure Computing 3, 2 (2006), 130--140.
[34]
Michael K. Patterson. 2008. The effect of data center temperature on energy efficiency. In Proceedings of the 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems. 1167--1174.
[35]
Nikzad Babaii Rizvandi, Albert Y. Zomaya, Young Choon Lee, Ali Javadzadeh Boloori, and Javid Taheri. 2012. Multiple frequency selection in DVFS-enabled processors to minimize energy consumption. In Energy-Efficient Distributed Computing Systems, A. Y. Zomaya and Y. C. Lee (Eds.). John Wiley & Sons, Inc., Hoboken, NJ.
[36]
Piyush Sao and Richard Vuduc. 2013. Self-stabilizing iterative solvers. In Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA).
[37]
Osman Sarood, Esteban Meneses, and Laxmikant V. Kale. 2013. A ‘cool’ way of improving the reliability of HPC machines. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 58:1--58:12.
[38]
Manu Shantharam, Sowmyalatha Srinivasmurthy, and Padma Raghavan. 2012. Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In Proceedings of the ACM International Conference on Supercomputing (ICS). 69--78.
[39]
Sam Toueg and Özalp Babaoglu. 1984. On the optimum checkpoint selection problem. SIAM Journal on Computing 13, 3 (1984), 630--649.
[40]
Frances Yao, Alan Demers, and Scott Shenker. 1995. A scheduling model for reduced CPU energy. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS). 374.
[41]
John W. Young. 1974. A first order approximation to the optimum checkpoint interval. Communications of the ACM 17, 9 (1974), 530--531.
[42]
Baoxian Zhao, Hakan Aydin, and Dakai Zhu. 2008. Reliability-aware dynamic voltage scaling for energy-constrained real-time embedded systems. In Proceedings of the IEEE International Conference on Computer Design (ICCD). 633--639.
[43]
Dakai Zhu, R. Melhem, and D. Mosse. 2004. The effects of energy management on reliability in real-time embedded systems. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 35--40.
[44]
J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, T. J. O’Gorman, and J. M. Ross. 1996b. Accelerated testing for cosmic soft-error rate. IBM Journal of Research and Development 40, 1 (1996), 51--72.
[45]
J. F. Ziegler, M. E. Nelson, J. D. Shell, R. J. Peterson, C. J. Gelderloos, H. P. Muhlfeld, and C. J. Montrose. 1998. Cosmic ray soft error rates of 16-Mb DRAM memory chips. IEEE Journal of Solid-State Circuits 33, 2 (1998), 246--252.
[46]
J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin. 1996a. IBM experiments in soft fails in computer electronics. IBM Journal of Research and Development 40, 1 (1996), 3--18.

Cited By

View all
  • (2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
  • (2024)A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?Future Generation Computer Systems10.1016/j.future.2024.07.022161(315-328)Online publication date: Dec-2024
  • (2022)Checkpointing à la Young/Daly: An OverviewProceedings of the 2022 Fourteenth International Conference on Contemporary Computing10.1145/3549206.3549328(701-710)Online publication date: 4-Aug-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 3, Issue 2
August 2016
154 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/2974644
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2016
Accepted: 01 February 2016
Revised: 01 January 2016
Received: 01 December 2014
Published in TOPC Volume 3, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HPC
  2. checkpoint
  3. fail-stop error
  4. failure
  5. resilience
  6. silent data corruption
  7. silent error
  8. verification

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
  • (2024)A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?Future Generation Computer Systems10.1016/j.future.2024.07.022161(315-328)Online publication date: Dec-2024
  • (2022)Checkpointing à la Young/Daly: An OverviewProceedings of the 2022 Fourteenth International Conference on Contemporary Computing10.1145/3549206.3549328(701-710)Online publication date: 4-Aug-2022
  • (2021)Probabilistic and Temporal Failure Detectors for Solving Distributed ProblemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.017Online publication date: Jul-2021
  • (2020)Coded QR Decomposition2020 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT44484.2020.9173985(191-196)Online publication date: 21-Jun-2020
  • (2019)A generic approach to scheduling and checkpointing workflowsThe International Journal of High Performance Computing Applications10.1177/1094342019866891(109434201986689)Online publication date: 12-Aug-2019
  • (2018)A Generic Approach to Scheduling and Checkpointing WorkflowsProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225145(1-10)Online publication date: 13-Aug-2018
  • (2018)Checkpointing Workflows for Fail-Stop ErrorsIEEE Transactions on Computers10.1109/TC.2018.2801300(1-1)Online publication date: 2018
  • (2018)Multi-level checkpointing and silent error detection for linear workflowsJournal of Computational Science10.1016/j.jocs.2017.03.02428(398-415)Online publication date: Sep-2018
  • (2017)Checkpointing Workflows for Fail-Stop Errors2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.14(487-497)Online publication date: Sep-2017

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media