ABSTRACT
Ensuring the reliability of applications is becoming an increasingly important challenge as high-performance computing (HPC) systems experience an ever-growing number of faults, errors and failures. While the HPC community has made substantial progress in developing various resilience solutions, it continues to rely on platform-based metrics to quantify application resiliency improvements. The resilience of an HPC application is concerned with the reliability of the application outcome as well as the fault handling efficiency. To understand the scope of impact, effective coverage and performance efficiency of existing and emerging resilience solutions, there is a need for new metrics. In this paper, we develop new ways to quantify resilience that consider both the reliability and the performance characteristics of the solutions from the perspective of HPC applications. As HPC systems continue to evolve in terms of scale and complexity, it is expected that applications will experience various types of faults, errors and failures, which will require applications to apply multiple resilience solutions across the system stack. The proposed metrics are intended to be useful for understanding the combined impact of these solutions on an application's ability to produce correct results and to evaluate their overall impact on an application's performance in the presence of various modes of faults.
- 2009. ResiliNets Strategy for Resilient and Survivable Networking. (2009). wiki.ittc.ku.edu/resilinetswiki/index.phpGoogle Scholar
- Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. 2004. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable Secure Computing (January 2004), 11--33. Google ScholarDigital Library
- P. Cholda, J. Tapolcai, T. Cinkler, K. Wajda, and A. Jajszczyk. 2009. Quality of resilience as a network reliability characterization tool. IEEE Network 23, 2 (March 2009), 11--19. Google ScholarDigital Library
- Kypros Constantinides, Stephen Plaza, Jason Blome, Bin Zhang, Valeria Bertacco, Scott Mahlke, Todd Austin, and Michael Orshansky. 2005. Assessing SEU Vulnerability via Circuit-Level Timing Analysis. In In Proceedings of the 1st Workshop on Architectural Reliability.Google Scholar
- J. T. Daly, L. A. Pritchett-Sheats, and S. E. Michalak. 2008. Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale. In 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID). 795--800. Google ScholarDigital Library
- N DeBardeleben, J Laros, JT Daly, SL Scott, C Engelmann, and B Harrod. 2009. High-End Computing Resilience: Analysis of issues facing the HEC community and path-forward for research and development. Whitepaper (December 2009).Google Scholar
- J. Laprie. 2005. Resilience for the Scalability of Dependability. In Fourth IEEE International Symposium on Network Computing and Applications. 5--6. Google ScholarDigital Library
- Jean-Claude Laprie. 1995. Dependable Computing: Concepts, Limits, Challenges. In Proceedings of the Twenty-Fifth International Conference on Fault-tolerant Computing (FTCS'95). IEEE Computer Society, Washington, DC, USA, 42--54. Google ScholarDigital Library
- Mojtaba Mehrara and Todd Austin. 2008. Exploiting Selective Placement for Low-cost Memory Protection. ACM Trans. Archit. Code Optim. 5, 3, Article 14 (Dec. 2008), 24 pages. Google ScholarDigital Library
- Shubhendu S. Mukherjee, Christopher Weaver, Joel Emer, Steven K. Reinhardt, and Todd Austin. 2003. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, Washington, DC, USA, 29--. Google ScholarDigital Library
- Andrei Paun, Clayton Chandler, Chokchai Box Leangsuksun, and Mihaela Paun. 2016. A failure index for {HPC} applications. J. Parallel and Distrib. Comput. 93--94 (July 2016), 146 -- 153. Google ScholarDigital Library
- R. M. Smith, K. S. Trivedi, and A. V. Ramesh. 1988. Performability analysis: measures, an algorithm, and a case study. IEEE Trans. Comput. 37, 4 (April 1988), 406--417. Google ScholarDigital Library
- Marc Snir and David A. Bader. 2004. A Framework for Measuring Supercomputer Productivity. International Journal of High Performance Computing Applications 18, 4 (Nov. 2004), 417--432. Google ScholarDigital Library
- Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A Chien, Paul Coteus, Nathan A DeBardeleben, Pedro C Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. 2014. Addressing failures in exascale computing. International Journal of High Performance Computing Applications 28, 2 (2014), 129--173. Google ScholarDigital Library
- Vilas Sridharan and David R. Kaeli. 2008. Quantifying Software Vulnerability. In Proceedings of the 2008 Workshop on Radiation Effects and Fault Tolerance in Nanometer Technologies (WREFT '08). ACM, New York, NY, USA, 323--328. Google ScholarDigital Library
- Jon Stearley. 2005. Defining and measuring supercomputer Reliability, Availability, and Serviceability (RAS). In In Proceedings of the Linux Clusters Institute Conference.Google Scholar
- Li Yu, Dong Li, Sparsh Mittal, and Jeffrey S. Vetter. 2014. Quantitatively Modeling Application Resilience with the Data Vulnerability Factor. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 695--706. Google ScholarDigital Library
Index Terms
Towards New Metrics for High-Performance Computing Resilience
Recommendations
A Pattern Language for High-Performance Computing Resilience
EuroPLoP '17: Proceedings of the 22nd European Conference on Pattern Languages of ProgramsHigh-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic ...
A tunable holistic resiliency approach for high-performance computing systems
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programmingIn order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault ...
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems
DSN '15: Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and NetworksAs we approach exascale, the scientific simulations are expected to experience more interruptions due to increased system failures. Designing better HPC resilience techniques requires understanding the key characteristics of system failures on these ...
Comments