skip to main content
10.1145/3086157.3086163acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Public Access

Towards New Metrics for High-Performance Computing Resilience

Published:26 June 2017Publication History

ABSTRACT

Ensuring the reliability of applications is becoming an increasingly important challenge as high-performance computing (HPC) systems experience an ever-growing number of faults, errors and failures. While the HPC community has made substantial progress in developing various resilience solutions, it continues to rely on platform-based metrics to quantify application resiliency improvements. The resilience of an HPC application is concerned with the reliability of the application outcome as well as the fault handling efficiency. To understand the scope of impact, effective coverage and performance efficiency of existing and emerging resilience solutions, there is a need for new metrics. In this paper, we develop new ways to quantify resilience that consider both the reliability and the performance characteristics of the solutions from the perspective of HPC applications. As HPC systems continue to evolve in terms of scale and complexity, it is expected that applications will experience various types of faults, errors and failures, which will require applications to apply multiple resilience solutions across the system stack. The proposed metrics are intended to be useful for understanding the combined impact of these solutions on an application's ability to produce correct results and to evaluate their overall impact on an application's performance in the presence of various modes of faults.

References

  1. 2009. ResiliNets Strategy for Resilient and Survivable Networking. (2009). wiki.ittc.ku.edu/resilinetswiki/index.phpGoogle ScholarGoogle Scholar
  2. Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. 2004. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable Secure Computing (January 2004), 11--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Cholda, J. Tapolcai, T. Cinkler, K. Wajda, and A. Jajszczyk. 2009. Quality of resilience as a network reliability characterization tool. IEEE Network 23, 2 (March 2009), 11--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Kypros Constantinides, Stephen Plaza, Jason Blome, Bin Zhang, Valeria Bertacco, Scott Mahlke, Todd Austin, and Michael Orshansky. 2005. Assessing SEU Vulnerability via Circuit-Level Timing Analysis. In In Proceedings of the 1st Workshop on Architectural Reliability.Google ScholarGoogle Scholar
  5. J. T. Daly, L. A. Pritchett-Sheats, and S. E. Michalak. 2008. Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale. In 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID). 795--800. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N DeBardeleben, J Laros, JT Daly, SL Scott, C Engelmann, and B Harrod. 2009. High-End Computing Resilience: Analysis of issues facing the HEC community and path-forward for research and development. Whitepaper (December 2009).Google ScholarGoogle Scholar
  7. J. Laprie. 2005. Resilience for the Scalability of Dependability. In Fourth IEEE International Symposium on Network Computing and Applications. 5--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jean-Claude Laprie. 1995. Dependable Computing: Concepts, Limits, Challenges. In Proceedings of the Twenty-Fifth International Conference on Fault-tolerant Computing (FTCS'95). IEEE Computer Society, Washington, DC, USA, 42--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Mojtaba Mehrara and Todd Austin. 2008. Exploiting Selective Placement for Low-cost Memory Protection. ACM Trans. Archit. Code Optim. 5, 3, Article 14 (Dec. 2008), 24 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Shubhendu S. Mukherjee, Christopher Weaver, Joel Emer, Steven K. Reinhardt, and Todd Austin. 2003. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, Washington, DC, USA, 29--. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Andrei Paun, Clayton Chandler, Chokchai Box Leangsuksun, and Mihaela Paun. 2016. A failure index for {HPC} applications. J. Parallel and Distrib. Comput. 93--94 (July 2016), 146 -- 153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. M. Smith, K. S. Trivedi, and A. V. Ramesh. 1988. Performability analysis: measures, an algorithm, and a case study. IEEE Trans. Comput. 37, 4 (April 1988), 406--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Marc Snir and David A. Bader. 2004. A Framework for Measuring Supercomputer Productivity. International Journal of High Performance Computing Applications 18, 4 (Nov. 2004), 417--432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A Chien, Paul Coteus, Nathan A DeBardeleben, Pedro C Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. 2014. Addressing failures in exascale computing. International Journal of High Performance Computing Applications 28, 2 (2014), 129--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Vilas Sridharan and David R. Kaeli. 2008. Quantifying Software Vulnerability. In Proceedings of the 2008 Workshop on Radiation Effects and Fault Tolerance in Nanometer Technologies (WREFT '08). ACM, New York, NY, USA, 323--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jon Stearley. 2005. Defining and measuring supercomputer Reliability, Availability, and Serviceability (RAS). In In Proceedings of the Linux Clusters Institute Conference.Google ScholarGoogle Scholar
  17. Li Yu, Dong Li, Sparsh Mittal, and Jeffrey S. Vetter. 2014. Quantitatively Modeling Application Resilience with the Data Vulnerability Factor. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 695--706. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Towards New Metrics for High-Performance Computing Resilience

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              FTXS '17: Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
              June 2017
              46 pages
              ISBN:9781450350013
              DOI:10.1145/3086157

              Copyright © 2017 ACM

              Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 26 June 2017

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate16of25submissions,64%

              Upcoming Conference

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader