skip to main content
research-article

Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Authors Info & Claims
Published:20 July 2016Publication History
Skip Abstract Section

Abstract

In this article, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bicriteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via Dynamic Voltage and Frequency Scaling (DVFS). In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.

References

  1. Susanne Albers and Hiroshi Fujiwara. 2007. Energy-efficient algorithms for flow time minimization. ACM Transactions on Algorithms 3, 4, Article 49 (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ismail Assayad, Alain Girault, and Hamoudi Kalla. 2013. Tradeoff exploration between reliability, power consumption, and execution time for embedded systems. International Journal on Software Tools for Technology Transfer 15, 3 (2013), 229--245.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Guillaume Aupy, Anne Benoit, Thomas Hérault, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2013. On the combination of silent error detection and checkpointing. In Proceedings of the 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC). 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Guillaume Aupy, Anne Benoit, and Yves Robert. 2012. Energy-aware scheduling under reliability and makespan constraints. In Proceedings of the International Conference on High Performance Computing (HiPC). 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  5. Nikhil Bansal, Tracy Kimbrel, and Kirk Pruhs. 2007. Speed scaling to manage energy and temperature. Journal of the ACM 54, 1 (2007), 3:1--3:39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Austin R. Benson, Sven Schmit, and Robert Schreiber. 2014. Silent error detection in numerical time-stepping schemes. The International Journal of High Performance Computing Applications DOI:10.1177/1094342014532297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. George Bosilca, Rémi Delmas, Jack Dongarra, and Julien Langou. 2009. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing 69, 4 (2009), 410--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, and Frédéric Vivien. 2011. Checkpointing strategies for parallel jobs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Greg Bronevetsky and Bronis de Supinski. 2008. Soft error vulnerability of iterative linear algebra methods. In Proceedings of the International Conference on Supercomputing (ICS). 155--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems 3, 1 (1985), 63--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Zizhong Chen. 2013. Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 167--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. T. Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22, 3 (2006), 303--312. Google ScholarGoogle ScholarCross RefCross Ref
  13. A. Das, A. Kumar, B. Veeravalli, C. Bolchini, and A. Miele. 2014. Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE). 61:1--61:6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Anand Dixit and Alan Wood. 2011. The impact of new technology on soft error rates. In IEEE International Reliability Physics Symposium (IRPS). 5B.4.1--5B.4.7.Google ScholarGoogle ScholarCross RefCross Ref
  15. Daniel R. Dooly, Sally A. Goldman, and Stephen D. Scott. 2001. On-line analysis of the TCP acknowledgment delay problem. Journal of the ACM 48, 2 (2001), 243--273. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nosayba El-Sayed, Ioan A. Stefanovici, George Amvrosiadis, Andy A. Hwang, and Bianca Schroeder. 2012. Temperature management in data centers: Why some (might) like it hot. SIGMETRICS Performance Evaluation Review 40, 1 (2012), 163--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. 2012. Combining partial redundancy and checkpointing for HPC. In Proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS). 615--626. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Survey 34, 3 (2002), 375--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Alex Fabrikant, Ankur Luthra, Elitza Maneva, Christos H. Papadimitriou, and Scott Shenker. 2003. On a network creation game. In Proceedings of the 22nd Annual Symposium on Principles of Distributed Computing (PODC’03). 347--351. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Wu-Chun Feng. 2003. Making a case for efficient supercomputing. Queue 1, 7 (Oct. 2003), 54--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Michael A. Heroux and Mark Hoemmen. 2011. Fault-Tolerant Iterative Methods via Selective Reliability. Research report SAND2011-3915 C. Sandia National Laboratories.Google ScholarGoogle Scholar
  23. Chung-hsing Hsu and Wu-chun Feng. 2005. A power-aware run-time system for high-performance computing. In Proceedings of the ACM/IEEE Supercomputing Conference (SC). 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kuang-Hua Huang and Jacob A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 33, 6 (1984), 518--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Andy A. Hwang, Ioan A. Stefanovici, and Bianca Schroeder. 2012. Cosmic rays don’t strike twice: Understanding the nature of DRAM errors and the implications for system design. SIGARCH Computer Architecture News 40, 1 (2012), 111--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Thomas Hérault and Yves Robert (Eds.). 2015. Fault-Tolerance Techniques for High-Performance Computing. Springer Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Mehdi Kargar, Aijun An, and Morteza Zihayat. 2012. Efficient bi-objective team formation in social networks. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases—Volume Part II (ECML PKDD’12). 483--498.Google ScholarGoogle ScholarCross RefCross Ref
  28. Guoming Lu, Ziming Zheng, and Andrew A. Chien. 2013. When is multi-version checkpointing needed? In Proceedings of the 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS). 49--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. E. Lyons and W. Vanderkulk. 1962. The use of triple-modular redundancy to improve computer reliability. IBM Journal of Research and Development 6, 2 (1962), 200--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R. de Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the ACM/IEEE SC Conference. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Xiang Ni, Esteban Meneses, Nikhil Jain, and Laxmikant V. Kalé. 2013. ACR: Automatic checkpoint/restart for soft and hard error protection. In Prococeedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’13). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Timothy J. O'Gorman. 1994. The effect of cosmic rays on the soft error rate of a DRAM at ground level. IEEE Transactions on Electron Devices 41, 4 (1994), 553--557.Google ScholarGoogle ScholarCross RefCross Ref
  33. Tatsuya Ozaki, Tadashi Dohi, Hiroyuki Okamura, and Naoto Kaio. 2006. Distribution-free checkpoint placement algorithms based on min-max principle. IEEE Transactions on Dependable and Secure Computing 3, 2 (2006), 130--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Michael K. Patterson. 2008. The effect of data center temperature on energy efficiency. In Proceedings of the 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems. 1167--1174.Google ScholarGoogle ScholarCross RefCross Ref
  35. Nikzad Babaii Rizvandi, Albert Y. Zomaya, Young Choon Lee, Ali Javadzadeh Boloori, and Javid Taheri. 2012. Multiple frequency selection in DVFS-enabled processors to minimize energy consumption. In Energy-Efficient Distributed Computing Systems, A. Y. Zomaya and Y. C. Lee (Eds.). John Wiley & Sons, Inc., Hoboken, NJ.Google ScholarGoogle Scholar
  36. Piyush Sao and Richard Vuduc. 2013. Self-stabilizing iterative solvers. In Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Osman Sarood, Esteban Meneses, and Laxmikant V. Kale. 2013. A ‘cool’ way of improving the reliability of HPC machines. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 58:1--58:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Manu Shantharam, Sowmyalatha Srinivasmurthy, and Padma Raghavan. 2012. Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In Proceedings of the ACM International Conference on Supercomputing (ICS). 69--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sam Toueg and Özalp Babaoglu. 1984. On the optimum checkpoint selection problem. SIAM Journal on Computing 13, 3 (1984), 630--649. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Frances Yao, Alan Demers, and Scott Shenker. 1995. A scheduling model for reduced CPU energy. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS). 374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. John W. Young. 1974. A first order approximation to the optimum checkpoint interval. Communications of the ACM 17, 9 (1974), 530--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Baoxian Zhao, Hakan Aydin, and Dakai Zhu. 2008. Reliability-aware dynamic voltage scaling for energy-constrained real-time embedded systems. In Proceedings of the IEEE International Conference on Computer Design (ICCD). 633--639.Google ScholarGoogle ScholarCross RefCross Ref
  43. Dakai Zhu, R. Melhem, and D. Mosse. 2004. The effects of energy management on reliability in real-time embedded systems. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 35--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, T. J. O’Gorman, and J. M. Ross. 1996b. Accelerated testing for cosmic soft-error rate. IBM Journal of Research and Development 40, 1 (1996), 51--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. F. Ziegler, M. E. Nelson, J. D. Shell, R. J. Peterson, C. J. Gelderloos, H. P. Muhlfeld, and C. J. Montrose. 1998. Cosmic ray soft error rates of 16-Mb DRAM memory chips. IEEE Journal of Solid-State Circuits 33, 2 (1998), 246--252.Google ScholarGoogle ScholarCross RefCross Ref
  46. J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin. 1996a. IBM experiments in soft fails in computer electronics. IBM Journal of Research and Development 40, 1 (1996), 3--18. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Parallel Computing
          ACM Transactions on Parallel Computing  Volume 3, Issue 2
          August 2016
          154 pages
          ISSN:2329-4949
          EISSN:2329-4957
          DOI:10.1145/2974644
          Issue’s Table of Contents

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 July 2016
          • Accepted: 1 February 2016
          • Revised: 1 January 2016
          • Received: 1 December 2014
          Published in topc Volume 3, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader