skip to main content
10.1145/2897937.2897996acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article
Public Access

CLEAR: Cross-Layer Exploration for Architecting Resilience - Combining hardware and software techniques to tolerate soft errors in processor cores

Published: 05 June 2016 Publication History

Abstract

We present a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time, area) by combining resilience techniques across various layers of the system stack (circuit, logic, architecture, software, algorithm). This is also referred to as cross-layer resilience. In this paper, we focus on radiation-induced soft errors in processor cores. We address both single-event upsets (SEUs) and single-event multiple upsets (SEMUs) in terrestrial environments. Our framework automatically and systematically explores the large space of comprehensive resilience techniques and their combinations across various layers of the system stack (798 cross-layer combinations in this paper), derives cost-effective solutions that achieve resilience targets at minimal costs, and provides guidelines for the design of new resilience techniques. We demonstrate the practicality and effectiveness of our framework using two diverse designs: a simple, in-order processor core and a complex, out-of-order processor core. Our results demonstrate that a carefully optimized combination of circuit-level hardening, logic-level parity checking, and micro-architectural recovery provides a highly cost-effective soft error resilience solution for general-purpose processor cores. For example, a 50× improvement in silent data corruption rate is achieved at only 2.1% energy cost for an out-of-order core (6.1% for an in-order core) with no speed impact. However, selective circuit-level hardening alone, guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a cost-effective soft error resilience solution as well (with ~1% additional energy cost for a 50× improvement in silent data corruption rate).

References

[1]
{Amusan 09} Amusan, O. A., "Effects of single-event-induced charge sharing in sub-100nm bulk CMOS technologies," Vanderbilt University Dissertation, 2009.
[2]
{Ando 03} Ando, H., et al., "A 1.3-GHz fifth-generation SPARC64 microprocessor," IEEE Journal Solid-State Circuits, 2003.
[3]
{Austin 99} Austin, T. M., "DIVA: a reliable substrate for deep submicron microarchitecture design," IEEE/ACM Intl. Symp. Microarchitecture, 1999.
[4]
{Borkar 05} Borkar, S., "Designing reliable systems from unreliable components: the challenges of transistor variability and degradation," IEEE Micro, 2005.
[5]
{Bosilca 09} Bosilca, G., et al., "Algorithmic based fault tolerance applied to high performance computing," Journal of Parallel and Distributed Computing, 2009.
[6]
{Bottoni 14} Bottoni, C., et al., "Heavy ions test result on a 65nm sparc-v8 radiation-hard microprocessor," IEEE Intl. Reliability Physics Symp., 2014.
[7]
{Bowman 09} Bowman, K. A., et al., "Energy-efficient and metastability-immune resilient circuits for dynamic variation tolerance," IEEE Journal Solid-State Circuits, 2009.
[8]
{Bowman 11} Bowman, K. A., et al., "A 45 nm resilient microprocessor core for dynamic variation tolerance," IEEE Journal Solid-State Circuits, 2011.
[9]
{Cappello 14} Cappello, F., et al., "Toward exascale resilience: 2014 update," Supercomputing Frontiers and Innovations, 2014.
[10]
{Carter 10} Carter, N. P., H. Naeimi, and D. S. Gardner, "Design techniques for cross-layer resilience," Design, Automation, & Test in Europe, 2010.
[11]
{Chen 05} Chen, Z. and J. Dongarra, "Numerically stable real number codes based on random matrices," Lecture Notes in Computer Science, 2005.
[12]
{Cheng 16} Cheng, E., et al., "CLEAR: cross-layer exploration for architecting resilience - combining hardware and software techniques to tolerate soft errors in processor cores," arXiv:1604.03062 {cs.AR}, 2016.
[13]
{Cho 13} Cho, H., et al., "Quantitative evaluation of soft error injection techniques for robust system design," ACM/EDAC/IEEE Design Automation Conf., 2013.
[14]
{Cho 15} Cho, H., et al., "Understanding soft errors in uncore components," ACM/EDAC/IEEE Design Automation Conf., 2015.
[15]
{DARPA} DARPA PERFECT benchmark suite, http://hpc.pnl.gov/PERFECT.
[16]
{Davis 09} Davis, J. D., C. P. Thacker, and C. Chang, "BEE3: revitalizing computer architecture research," Microsoft Tech. Rep., 2009.
[17]
{DeHon 10} DeHon, A., H. M. Quinn, and N. P. Carter, "Vision for cross-layer optimization to address the dual challenges of energy and reliability," Design, Automation, & Test in Europe, 2010.
[18]
{Gill 09} Gill, B., N. Seifert, and V. Zia, "Comparison of alpha-particle and neutron-induced combinational and sequential logic error rates at the 32nm technology node," IEEE Intl. Reliability Physics Symp., 2009.
[19]
{Gupta 14} Gupta, M. S., et al., "Cross-layer system resilience at affordable power," IEEE Intl. Reliability Physics Symp., 2014.
[20]
{Hari 12} Hari S. K. S., S. V. Adve, and H. Naeimi, "Low-cost program-level detectors for reducing silent data corruptions," IEEE/IFIP Intl. Conf. Dependable Systems & Networks, 2012.
[21]
{Henkel 14} Henkel, J., et al., "Multi-Layer dependability: from microarchitecture to application level," ACM/EDAC/IEEE Design Automation Conf., 2014.
[22]
{Henning 00} Henning, J. L., "SPEC CPU2000: measuring CPU performance in the new millennium," IEEE Computer, 2000.
[23]
{Huang 84} Huang, K.-H. and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations," IEEE Trans. Computers, 1984.
[24]
{Lee 10} Lee, H.-H. K., et al., "LEAP: layout design through error-aware transistor positioning for soft-error resilient sequential cell design," IEEE Intl. Reliability Physics Symp., 2010.
[25]
{Leon} Aeroflex G., "Leon3 processor," http://www.gaisler.com.
[26]
{Lilja 13} Lilja, K., et al., "Single-event performance and layout optimization of flip-flops in a 28-nm bulk technology," IEEE Trans. Nuclear Science, 2013.
[27]
{Lin 14} Lin, D., et al., "Effective post-silicon validation of system-on-chips using quick error detection," IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 2014.
[28]
{Lu 82} Lu, D. J., "Watchdog processor and structural integrity checking," IEEE Trans. Computers, 1982.
[29]
{Mahatme 13} Mahatme, N. N., et al., "Impact of supply voltage and frequency on the soft error rates of logic circuits," IEEE Trans. Nuclear Science, 2013.
[30]
{Meaney 05} Meaney, P. J., et al., "IBM z990 soft error detection and recovery," IEEE Trans. Device and Materials Reliability, 2005.
[31]
{Meixner 07} Meixner, A., M. E. Bauer, and D. J. Sorin, "Argus: low-cost comprehensive error detection in simple cores," IEEE/ACM Intl. Symp. Microarchitecture, 2007.
[32]
{Michalak 12} Michalak, S. E., et al., "Assessment of the impact of cosmicray-induced neutrons on hardware in the roadrunner supercomputer," IEEE Trans. Device and Materials Reliability, 2012.
[33]
{Mirkhani 15a} Mirkhani, S., B. Samynathan, and J. A. Abraham, "Detailed soft error vulnerability analysis using synthetic benchmarks," IEEE VLSI Test Symp., 2015.
[34]
{Mirkhani 15b} Mirkhani, S., et al., "Efficient soft error vulnerability estimation of complex designs," Design, Automation, & Test in Europe, 2015.
[35]
{Mitra 05} Mitra S., et al., "Robust system design with built-in soft error resilience," IEEE Computer, 2005.
[36]
{Mitra 14} Mitra, S., et al., "The resilience wall: cross-layer solution strategies," Intl. Symp. VLSI Technology, Systems and Applications, 2014.
[37]
{Nair 90} Nair, V. S. S. and J. A. Abraham, "Real-number codes for fault-tolerant matrix operations on processor arrays," IEEE Trans. Computers, 1990.
[38]
{Oh 02a} Oh, N., P. P. Shirvani, and E. J. McCluskey, "Control flow checking by software signatures," IEEE Trans. Reliability, 2002.
[39]
{Oh 02b} Oh, N., P. P. Shirvani, and E. J. McCluskey, "Error detection by duplicated instructions in super-scalar processors," IEEE Trans. Reliability, 2002.
[40]
{Pawlowski 14} Pawlowski, R., et al., "Characterization of radiation-induced SRAM and logic soft errors from 0.33V to 1.0V in 65 nm CMOS," IEEE Custom Integrated Circuits Conf., 2014.
[41]
{Pedram 12} Pedram, M., et al., "Report for the NSF workshop on cross-layer power optimization and management," NSF, 2012.
[42]
{Racunas 07} Racunas P., et al., "Perturbation-based fault screening," IEEE Intl. Symp. High Performance Computer Architecture, 2007.
[43]
{Ramachandran 08} Ramachandran, P., et al., "Statistical fault injection," IEEE/IFIP Intl. Conf. Dependable Systems & Networks, 2008.
[44]
{Sahoo 08} Sahoo S. K., et al., "Using likely program invariants to detect hardware errors," IEEE/IFIP Intl. Conf. Dependable Systems & Networks, 2008.
[45]
{Sanda 08} Sanda, P. N., et al., "Soft-error resilience of the IBM POWER6 processor," IBM Journal of Research and Development, 2008.
[46]
{Schirmeier 15} Schirmeier, H., C. Borchert, and O. Spinczyk, "Avoiding pitfalls in fault-injection based comparison of program susceptibility to soft errors," IEEE/IFIP Intl. Conf. Dependable Systems & Networks, 2015.
[47]
{Seifert 10} Seifert, N., et al., "On the radiation-induced soft error performance of hardened sequential element in advanced bulk CMOS technologies," IEEE Intl. Reliability Physics Symp., 2010.
[48]
{Seifert 12} Seifert N., et al., "Soft error susceptibilities of 22 nm tri-gate devices," IEEE Trans. Nuclear Science, 2012.
[49]
{Sinharoy 11} Sinharoy, B., et al., "IBM POWER7 multicore server processor," IBM Journal of Research and Development, 2011.
[50]
{Spainhower 99} Spainhower, L. and T. A. Gregg, "IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective," IBM Journal of Research & Development, 1999.
[51]
{Synopsys} Synopsys, Inc., Synopsys design suite.
[52]
{Tao 93} Tao, D. L. and C. R. P. Hartmann, "A novel concurrent error detection scheme for FFT networks," IEEE Trans. Parallel and Distributed Systems, 1993.
[53]
{Wang 04} Wang N. J., et al., "Characterizing the effects of transient faults on a high-performance processor pipeline," IEEE/IFIP Intl. Conf. Dependable Systems & Networks, 2004.
[54]
{Wang 05} Wang, N. J. and S. J. Patel, "ReStore: symptom based soft error detection in microprocessors," IEEE/IFIP Intl. Conf. Dependable Systems & Networks, 2005.
[55]
{Wang 07} Wang, N. J., A. Mahesri, and S. J. Patel, "Examining ACE analysis reliability estimates using fault-injection," ACM/IEEE Intl. Symp. on Computer Architecture, 2007.
[56]
{Zhang 06} Zhang, M., et al., "Sequential element design with built-in soft error resilience," IEEE Trans. VLSI, 2006.

Cited By

View all
  • (2024)Cross-layer Bayesian Network for UAV Health Monitoring2024 2nd International Conference on Unmanned Vehicle Systems-Oman (UVS)10.1109/UVS59630.2024.10467174(1-7)Online publication date: 12-Feb-2024
  • (2024)Generic Soft Error Data and Control Flow Error Detection by Instruction DuplicationIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.324584221:1(78-92)Online publication date: Jan-2024
  • (2024)Silent Data Corruption in Robot Operating System: A Case for End-to-End System-Level Fault Analysis Using Autonomous UAVsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333229343:4(1037-1050)Online publication date: Apr-2024
  • Show More Cited By

Index Terms

  1. CLEAR: Cross-Layer Exploration for Architecting Resilience - Combining hardware and software techniques to tolerate soft errors in processor cores

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        DAC '16: Proceedings of the 53rd Annual Design Automation Conference
        June 2016
        1048 pages
        ISBN:9781450342360
        DOI:10.1145/2897937
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 05 June 2016

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. cross-layer resilience
        2. soft errors

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        DAC '16

        Acceptance Rates

        Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)140
        • Downloads (Last 6 weeks)21
        Reflects downloads up to 07 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Cross-layer Bayesian Network for UAV Health Monitoring2024 2nd International Conference on Unmanned Vehicle Systems-Oman (UVS)10.1109/UVS59630.2024.10467174(1-7)Online publication date: 12-Feb-2024
        • (2024)Generic Soft Error Data and Control Flow Error Detection by Instruction DuplicationIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.324584221:1(78-92)Online publication date: Jan-2024
        • (2024)Silent Data Corruption in Robot Operating System: A Case for End-to-End System-Level Fault Analysis Using Autonomous UAVsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333229343:4(1037-1050)Online publication date: Apr-2024
        • (2024)Fault Tolerant ArchitecturesHandbook of Computer Architecture10.1007/978-981-97-9314-3_11(277-320)Online publication date: 21-Dec-2024
        • (2023)Assessing Convolutional Neural Networks Reliability through Statistical Fault Injections2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10136998(1-6)Online publication date: Apr-2023
        • (2023)Understanding and Mitigating Hardware Failures in Deep Learning Training SystemsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589105(1-16)Online publication date: 17-Jun-2023
        • (2023)Resiliency in Digital Processing Systems2023 IEEE 33rd International Conference on Microelectronics (MIEL)10.1109/MIEL58498.2023.10315819(1-8)Online publication date: 16-Oct-2023
        • (2023)Holistic IJTAG-based External and Internal Fault Monitoring in UAVs2023 IEEE 24th Latin American Test Symposium (LATS)10.1109/LATS58125.2023.10154489(1-6)Online publication date: 21-Mar-2023
        • (2023)Characterizing Runtime Performance Variation in Error Detection by Duplicating Instructions2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00043(730-741)Online publication date: 9-Oct-2023
        • (2023)Multicore soft error reliability assessment and evaluation of Compiler optimization flag Effects on ARMv7 and ARMv82023 IEEE 20th India Council International Conference (INDICON)10.1109/INDICON59947.2023.10440741(555-560)Online publication date: 14-Dec-2023
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media