research-article

Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Authors:

Aurélien Cavelan,

Hongyang SunAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 3, Issue 2

Article No.: 13, Pages 1 - 36

https://doi.org/10.1145/2897189

Published: 20 July 2016 Publication History

Abstract

In this article, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bicriteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via Dynamic Voltage and Frequency Scaling (DVFS). In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.

References

[1]

Susanne Albers and Hiroshi Fujiwara. 2007. Energy-efficient algorithms for flow time minimization. ACM Transactions on Algorithms 3, 4, Article 49 (2007).

Digital Library

[2]

Ismail Assayad, Alain Girault, and Hamoudi Kalla. 2013. Tradeoff exploration between reliability, power consumption, and execution time for embedded systems. International Journal on Software Tools for Technology Transfer 15, 3 (2013), 229--245.

Digital Library

[3]

Guillaume Aupy, Anne Benoit, Thomas Hérault, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2013. On the combination of silent error detection and checkpointing. In Proceedings of the 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC). 11--20.

Digital Library

[4]

Guillaume Aupy, Anne Benoit, and Yves Robert. 2012. Energy-aware scheduling under reliability and makespan constraints. In Proceedings of the International Conference on High Performance Computing (HiPC). 1--10.

[5]

Nikhil Bansal, Tracy Kimbrel, and Kirk Pruhs. 2007. Speed scaling to manage energy and temperature. Journal of the ACM 54, 1 (2007), 3:1--3:39.

Digital Library

[6]

Austin R. Benson, Sven Schmit, and Robert Schreiber. 2014. Silent error detection in numerical time-stepping schemes. The International Journal of High Performance Computing Applications

Digital Library

[7]

George Bosilca, Rémi Delmas, Jack Dongarra, and Julien Langou. 2009. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing 69, 4 (2009), 410--416.

Digital Library

[8]

Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, and Frédéric Vivien. 2011. Checkpointing strategies for parallel jobs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--11.

Digital Library

[9]

Greg Bronevetsky and Bronis de Supinski. 2008. Soft error vulnerability of iterative linear algebra methods. In Proceedings of the International Conference on Supercomputing (ICS). 155--164.

Digital Library

[10]

K. Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems 3, 1 (1985), 63--75.

Digital Library

[11]

Zizhong Chen. 2013. Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 167--176.

Digital Library

[12]

J. T. Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22, 3 (2006), 303--312.

[13]

A. Das, A. Kumar, B. Veeravalli, C. Bolchini, and A. Miele. 2014. Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE). 61:1--61:6.

Digital Library

[14]

Anand Dixit and Alan Wood. 2011. The impact of new technology on soft error rates. In IEEE International Reliability Physics Symposium (IRPS). 5B.4.1--5B.4.7.

[15]

Daniel R. Dooly, Sally A. Goldman, and Stephen D. Scott. 2001. On-line analysis of the TCP acknowledgment delay problem. Journal of the ACM 48, 2 (2001), 243--273.

Digital Library

[16]

Nosayba El-Sayed, Ioan A. Stefanovici, George Amvrosiadis, Andy A. Hwang, and Bianca Schroeder. 2012. Temperature management in data centers: Why some (might) like it hot. SIGMETRICS Performance Evaluation Review 40, 1 (2012), 163--174.

Digital Library

[17]

James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. 2012. Combining partial redundancy and checkpointing for HPC. In Proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS). 615--626.

Digital Library

[18]

E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Survey 34, 3 (2002), 375--408.

Digital Library

[19]

Alex Fabrikant, Ankur Luthra, Elitza Maneva, Christos H. Papadimitriou, and Scott Shenker. 2003. On a network creation game. In Proceedings of the 22nd Annual Symposium on Principles of Distributed Computing (PODC’03). 347--351.

Digital Library

[20]

Wu-Chun Feng. 2003. Making a case for efficient supercomputing. Queue 1, 7 (Oct. 2003), 54--64.

Digital Library

[21]

David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 78.

Digital Library

[22]

Michael A. Heroux and Mark Hoemmen. 2011. Fault-Tolerant Iterative Methods via Selective Reliability. Research report SAND2011-3915 C. Sandia National Laboratories.

[23]

Chung-hsing Hsu and Wu-chun Feng. 2005. A power-aware run-time system for high-performance computing. In Proceedings of the ACM/IEEE Supercomputing Conference (SC). 1--9.

Digital Library

[24]

Kuang-Hua Huang and Jacob A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 33, 6 (1984), 518--528.

Digital Library

[25]

Andy A. Hwang, Ioan A. Stefanovici, and Bianca Schroeder. 2012. Cosmic rays don’t strike twice: Understanding the nature of DRAM errors and the implications for system design. SIGARCH Computer Architecture News 40, 1 (2012), 111--122.

Digital Library

[26]

Thomas Hérault and Yves Robert (Eds.). 2015. Fault-Tolerance Techniques for High-Performance Computing. Springer Verlag.

Digital Library

[27]

Mehdi Kargar, Aijun An, and Morteza Zihayat. 2012. Efficient bi-objective team formation in social networks. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases—Volume Part II (ECML PKDD’12). 483--498.

[28]

Guoming Lu, Ziming Zheng, and Andrew A. Chien. 2013. When is multi-version checkpointing needed? In Proceedings of the 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS). 49--56.

Digital Library

[29]

R. E. Lyons and W. Vanderkulk. 1962. The use of triple-modular redundancy to improve computer reliability. IBM Journal of Research and Development 6, 2 (1962), 200--209.

Digital Library

[30]

Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R. de Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the ACM/IEEE SC Conference. 1--11.

Digital Library

[31]

Xiang Ni, Esteban Meneses, Nikhil Jain, and Laxmikant V. Kalé. 2013. ACR: Automatic checkpoint/restart for soft and hard error protection. In Prococeedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’13). ACM.

Digital Library

[32]

Timothy J. O'Gorman. 1994. The effect of cosmic rays on the soft error rate of a DRAM at ground level. IEEE Transactions on Electron Devices 41, 4 (1994), 553--557.

[33]

Tatsuya Ozaki, Tadashi Dohi, Hiroyuki Okamura, and Naoto Kaio. 2006. Distribution-free checkpoint placement algorithms based on min-max principle. IEEE Transactions on Dependable and Secure Computing 3, 2 (2006), 130--140.

Digital Library

[34]

Michael K. Patterson. 2008. The effect of data center temperature on energy efficiency. In Proceedings of the 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems. 1167--1174.

[35]

Nikzad Babaii Rizvandi, Albert Y. Zomaya, Young Choon Lee, Ali Javadzadeh Boloori, and Javid Taheri. 2012. Multiple frequency selection in DVFS-enabled processors to minimize energy consumption. In Energy-Efficient Distributed Computing Systems, A. Y. Zomaya and Y. C. Lee (Eds.). John Wiley & Sons, Inc., Hoboken, NJ.

[36]

Piyush Sao and Richard Vuduc. 2013. Self-stabilizing iterative solvers. In Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA).

Digital Library

[37]

Osman Sarood, Esteban Meneses, and Laxmikant V. Kale. 2013. A ‘cool’ way of improving the reliability of HPC machines. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 58:1--58:12.

Digital Library

[38]

Manu Shantharam, Sowmyalatha Srinivasmurthy, and Padma Raghavan. 2012. Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In Proceedings of the ACM International Conference on Supercomputing (ICS). 69--78.

Digital Library

[39]

Sam Toueg and Özalp Babaoglu. 1984. On the optimum checkpoint selection problem. SIAM Journal on Computing 13, 3 (1984), 630--649.

Digital Library

[40]

Frances Yao, Alan Demers, and Scott Shenker. 1995. A scheduling model for reduced CPU energy. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS). 374.

Digital Library

[41]

John W. Young. 1974. A first order approximation to the optimum checkpoint interval. Communications of the ACM 17, 9 (1974), 530--531.

Digital Library

[42]

Baoxian Zhao, Hakan Aydin, and Dakai Zhu. 2008. Reliability-aware dynamic voltage scaling for energy-constrained real-time embedded systems. In Proceedings of the IEEE International Conference on Computer Design (ICCD). 633--639.

[43]

Dakai Zhu, R. Melhem, and D. Mosse. 2004. The effects of energy management on reliability in real-time embedded systems. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 35--40.

Digital Library

[44]

J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, T. J. O’Gorman, and J. M. Ross. 1996b. Accelerated testing for cosmic soft-error rate. IBM Journal of Research and Development 40, 1 (1996), 51--72.

Digital Library

[45]

J. F. Ziegler, M. E. Nelson, J. D. Shell, R. J. Peterson, C. J. Gelderloos, H. P. Muhlfeld, and C. J. Montrose. 1998. Cosmic ray soft error rates of 16-Mb DRAM memory chips. IEEE Journal of Solid-State Circuits 33, 2 (1998), 246--252.

[46]

J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin. 1996a. IBM experiments in soft fails in computer electronics. IBM Journal of Research and Development 40, 1 (1996), 3--18.

Digital Library

Cited By

Benoit APerotin LRobert YVivien F(2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
https://dl.acm.org/doi/10.1145/3624560
Bautista-Gomez LBenoit ADi SHerault TRobert YSun H(2024)A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?Future Generation Computer Systems10.1016/j.future.2024.07.022161(315-328)Online publication date: Dec-2024
https://doi.org/10.1016/j.future.2024.07.022
Benoit ADu YHerault TMarchal LPallez GPerotin LRobert YSun HVivien FSahni SSaxena VIyengar S(2022)Checkpointing à la Young/Daly: An OverviewProceedings of the 2022 Fourteenth International Conference on Contemporary Computing10.1145/3549206.3549328(701-710)Online publication date: 4-Aug-2022
https://dl.acm.org/doi/10.1145/3549206.3549328
Show More Cited By

Index Terms

Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Recommendations

Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC Platforms
This article studies checkpointing strategies for parallel applications subject to failures. The optimal strategy to minimize total execution time, or makespan, is well known when failure IATs obey an Exponential distribution, but it is unknown for non-...
Efficient checkpoint/verification patterns

Errors have become a critical problem for high-performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their peculiarity is that such errors are identified ...
Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors
High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation
Abstract
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to address both fail-stop and silent errors. The objective is to minimize either makespan or energy consumption. While DVFS is a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 3, Issue 2

August 2016

154 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/2974644

Editor:
Phillip B. Gibbons
Carnegie Mellon University, Pittsburgh, USA

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2016

Accepted: 01 February 2016

Revised: 01 January 2016

Received: 01 December 2014

Published in TOPC Volume 3, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Agence Nationale de la Recherche

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
148
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Benoit APerotin LRobert YVivien F(2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
https://dl.acm.org/doi/10.1145/3624560
Bautista-Gomez LBenoit ADi SHerault TRobert YSun H(2024)A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?Future Generation Computer Systems10.1016/j.future.2024.07.022161(315-328)Online publication date: Dec-2024
https://doi.org/10.1016/j.future.2024.07.022
Benoit ADu YHerault TMarchal LPallez GPerotin LRobert YSun HVivien FSahni SSaxena VIyengar S(2022)Checkpointing à la Young/Daly: An OverviewProceedings of the 2022 Fourteenth International Conference on Contemporary Computing10.1145/3549206.3549328(701-710)Online publication date: 4-Aug-2022
https://dl.acm.org/doi/10.1145/3549206.3549328
Guerraoui RKozhaya DPignolet Y(2021)Probabilistic and Temporal Failure Detectors for Solving Distributed ProblemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.017Online publication date: Jul-2021
https://doi.org/10.1016/j.jpdc.2021.07.017
Nguyen QJeong HGrover P(2020)Coded QR Decomposition2020 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT44484.2020.9173985(191-196)Online publication date: 21-Jun-2020
https://dl.acm.org/doi/10.1109/ISIT44484.2020.9173985
Han LLe Fèvre VCanon LRobert YVivien F(2019)A generic approach to scheduling and checkpointing workflowsThe International Journal of High Performance Computing Applications10.1177/1094342019866891(109434201986689)Online publication date: 12-Aug-2019
https://doi.org/10.1177/1094342019866891
Han LLe Fèvre VCanon LRobert YVivien F(2018)A Generic Approach to Scheduling and Checkpointing WorkflowsProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225145(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225145
Han LCanon LCasanova HRobert YVivien F(2018)Checkpointing Workflows for Fail-Stop ErrorsIEEE Transactions on Computers10.1109/TC.2018.2801300(1-1)Online publication date: 2018
https://doi.org/10.1109/TC.2018.2801300
Benoit ACavelan ARobert YSun H(2018)Multi-level checkpointing and silent error detection for linear workflowsJournal of Computational Science10.1016/j.jocs.2017.03.02428(398-415)Online publication date: Sep-2018
https://doi.org/10.1016/j.jocs.2017.03.024
Han LCanon LCasanova HRobert YVivien F(2017)Checkpointing Workflows for Fail-Stop Errors2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.14(487-497)Online publication date: Sep-2017
https://doi.org/10.1109/CLUSTER.2017.14

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents