skip to main content
article
Open access

Software-controlled fault tolerance

Published: 01 December 2005 Publication History

Abstract

Traditional fault-tolerance techniques typically utilize resources ineffectively because they cannot adapt to the changing reliability and performance demands of a system. This paper proposes software-controlled fault tolerance, a concept allowing designers and users to tailor their performance and reliability for each situation. Several software-controllable fault-detection techniques are then presented: SWIFT, a software-only technique, and CRAFT, a suite of hybrid hardware/software techniques. Finally, the paper introduces PROFiT, a technique which adjusts the level of protection and performance at fine granularities through software control. When coupled with software-controllable techniques like SWIFT and CRAFT, PROFiT offers attractive and novel reliability options.

References

[1]
Austin, T. M. 1999. DIVA: a reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture. IEEE Computer Society. 196--207.
[2]
Baumann, R. C. 2001. Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Transactions on Device and Materials Reliability 1, 1 (Mar.), 17--22.
[3]
Bolchini, C. and Salice, F. 2001. A software methodology for detecting hardware faults in vliw data paths. In IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.
[4]
Bossen, D. C. 2002. CMOS soft errors and server design. In IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals. 121_07.1--121_07.6.
[5]
Czeck, E. W. and Siewiorek, D. 1990. Effects of transient gate-level faults on program behavior. In Proceedings of the 1990 International Symposium on Fault-Tolerant Computing. 236--243.
[6]
Dean, A. G. and Shen, J. P. 1998. Techniques for software thread integration in real-time embedded systems. In Proceedings of the IEEE Real-Time Systems Symposium, Washington, DC. IEEE Computer Society. 322.
[7]
Gomaa, M., Scarbrough, C., Vijaykumar, T. N., and Pomeranz, I. 2003. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th Annual International Symposium on Computer Architecture. ACM Press, New York, 98--109.
[8]
Holm, J. G. and Banerjee, P. 1992. Low cost concurrent error detection in a VLIW architecture using replicated instructions. In Proceedings of the 1992 International Conference on Parallel Processing 1, 192--195.
[9]
Horst, R. W., Harris, R. L., and Jardine, R. L. 1990. Multiple instruction issue in the NonStop Cyclone processor. In Proceedings of the 17th International Symposium on Computer Architecture. 216--226.
[10]
Intel Corporation. 2002. Intel Itanium Architecture Software Developer's Manual, Vol. 1--3. Santa Clara, CA.
[11]
Kim, S. and Somani, A. K. 2002. Soft error sensitivity characterization for microprocessor dependability enhancement strategy. In Proceedings of the 2002 International Conference on Dependable Systems and Networks. 416--425.
[12]
Lee, C., Potkonjak, M., and Mangione-Smith, W. 1997. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual International Symposium on Microarchitecture. 330--335.
[13]
Mahmood, A. and McCluskey, E. J. 1988. Concurrent error detection using watchdog processors-a survey. IEEE Transactions on Computers 37, 2, 160--174.
[14]
Mukherjee, S. S., Kontz, M., and Reinhardt, S. K. 2002. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE Computer Society. 99--110.
[15]
Mukherjee, S. S., Weaver, C., Emer, J., Reinhardt, S. K., and Austin, T. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society. 29.
[16]
O'Gorman, T. J., Ross, J. M., Taber, A. H., Ziegler, J. F., Muhlfeld, H. P., Montrose, I. C. J., Curtis, H. W., and Walsh, J. L. 1996. Field testing for cosmic ray soft errors in semiconductor memories. In IBM Journal of Research and Development. 41--49.
[17]
Oh, N. and McCluskey, E. J. 2001. Low energy error detection technique using procedure call duplication. In Proceedings of the 2001 International Symposium on Dependable Systems and Networks.
[18]
Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002a. Control-flow checking by software signatures. In IEEE Transactions on Reliability 51, 111--122.
[19]
Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002b. ED4I: Error detection by diverse data and duplicated instructions. In IEEE Transactions on Computers 51, 180--199.
[20]
Oh, N., Shirvani, P. P., and McCluskey, E. J. 2002c. Error detection by duplicated instructions in super-scalar processors. In IEEE Transactions on Reliability 51, 63--75.
[21]
Ohlsson, J. and Rimen, M. 1995. Implicit signature checking. In International Conference on Fault-Tolerant Computing.
[22]
Patel, J. H. and Fung, L. Y. 1982. Concurrent error detection in alu's by recomputing with shifted operands. IEEE Transactions on Computers 31, 7, 589--595.
[23]
Penry, D. A., Vachharajani, M., and August, D. I. 2005. Rapid development of a flexible validated processor model. In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation (MOBS).
[24]
Ray, J., Hoe, J. C., and Falsafi, B. 2001. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society. 214--224.
[25]
Rebaudengo, M., Reorda, M. S., Violante, M., and Torchiano, M. 2001. A source-to-source compiler for generating dependable software. In IEEE International Workshop on Source Code Analysis and Manipulation. 33--42.
[26]
Reinhardt, S. K. and Mukherjee, S. S. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture. ACM Press, New York, 25--36.
[27]
Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., and August, D. I. 2005a. SWIFT: Software implemented fault tolerance. In Proceedings of the 3rd International Symposium on Code Generation and Optimization.
[28]
Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., August, D. I., and Mukherjee, S. S. 2005b. Design and evaluation of hybrid fault-detection systems. In Proceedings of the 32th Annual International Symposium on Computer Architecture. 148--159.
[29]
Rotenberg, E. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing. IEEE Computer Society. 84.
[30]
Saxena, N. and McCluskey, E. J. 1998. Dependable adaptive computing systems---the ROAR project. In International Conference on Systems, Man, and Cybernetics. 2172--2177.
[31]
Schuette, M. A. and Shen, J. P. 1994. Exploiting instruction-level parallelism for integrated control-flow monitoring. IEEE Transactions on Computers 43, 129--133.
[32]
Shirvani, P. P., Saxena, N., and McCluskey, E. J. 2000. Software-implemented EDAC protection against SEUs. IEEE Transactions on Reliability 49, 273--284.
[33]
Shivakumar, P., Kistler, M., Keckler, S. W., Burger, D., and Alvisi, L. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the 2002 International Conference on Dependable Systems and Networks. 389--399.
[34]
Slegel, T. J., Averill III, R. M., Check, M. A., Giamei, B. C., Krumm, B. W., Krygowski, C. A., Li, W. H., Liptay, J. S., MacDougall, J. D., McPherson, T. J., Navarro, J. A., Schwarz, E. M., Shum, K., and Webb, C. F. 1999. IBM's S/390 G5 Microprocessor design. IEEE Micro 19, 12--23.
[35]
Vachharajani, M., Vachharajani, N., Penry, D. A., Blome, J. A., and August, D. I. 2002. Microarchitectural exploration with Liberty. In Proceedings of the 35th International Symposium on Microarchitecture (MICRO). 271--282.
[36]
Vachharajani, M., Vachharajani, N., and August, D. I. 2004. The Liberty Structural Specification Language: A high-level modeling language for component reuse. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI). 195--206.
[37]
Venkatasubramanian, R., Hayes, J. P., and Murray, B. T. 2003. Low-cost on-line fault detection using control-flow assertions. In Proceedings of the 9th IEEE International On-Line Testing Symposium. 137--143.
[38]
Vijaykumar, T. N., Pomeranz, I., and Cheng, K. 2002. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE Computer Society. 87--98.
[39]
Wang, N., Fertig, M., and Patel, S. J. 2003. Y-branches: When you come to a fork in the road, take it. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. 56--67.
[40]
Wang, N. J., Quek, J., Rafacz, T. M., and Patel, S. J. 2004. Characterizing the effects of transient faults on a high-performance processor pipeline. In Proceedings of the 2004 International Conference on Dependendable Systems and Networks. 61--72.
[41]
Weaver, C., Emer, J., Mukherjee, S. S., and Reinhardt, S. K. 2004. Techniques to reduce the soft error rate of a high-performance microprocessor. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA).
[42]
Yeh, Y. 1996. Triple-triple redundant 777 primary flight computer. In Proceedings of the 1996 IEEE Aerospace Applications Conference 1, 293--307.
[43]
Yeh, Y. 1998. Design considerations in Boeing 777 fly-by-wire computers. In Proceedings of the Third IEEE International High-Assurance Systems Engineering Symposium. 64--72.

Cited By

View all
  • (2024)Soft Error Assessment of Attitude Estimation Algorithms Running on Resource-Constrained Devices Under Neutron RadiationIEEE Transactions on Nuclear Science10.1109/TNS.2024.337868971:8(1511-1519)Online publication date: Aug-2024
  • (2023)Power and Performance Costs of Radiation-Hardened ML Inference Models Running on Edge Devices2023 36th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)10.1109/SBCCI60457.2023.10261657(1-6)Online publication date: 28-Aug-2023
  • (2023)Power, Performance and Reliability Evaluation of Multi-thread Machine Learning Inference Models Executing in Multicore Edge Devices2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI59464.2023.10238535(1-6)Online publication date: 20-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 2, Issue 4
December 2005
116 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/1113841
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2005
Published in TACO Volume 2, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Software-controlled fault tolerance
  2. fault detection
  3. reliability

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)304
  • Downloads (Last 6 weeks)31
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Soft Error Assessment of Attitude Estimation Algorithms Running on Resource-Constrained Devices Under Neutron RadiationIEEE Transactions on Nuclear Science10.1109/TNS.2024.337868971:8(1511-1519)Online publication date: Aug-2024
  • (2023)Power and Performance Costs of Radiation-Hardened ML Inference Models Running on Edge Devices2023 36th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)10.1109/SBCCI60457.2023.10261657(1-6)Online publication date: 28-Aug-2023
  • (2023)Power, Performance and Reliability Evaluation of Multi-thread Machine Learning Inference Models Executing in Multicore Edge Devices2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI59464.2023.10238535(1-6)Online publication date: 20-Jun-2023
  • (2023)Soft Error Reliability Assessment of ML Inference Models Executing on Resource-Constrained IoT Edge DevicesEarly Soft Error Reliability Assessment of Convolutional Neural Networks Executing on Resource-Constrained IoT Edge Devices10.1007/978-3-031-18599-1_6(87-127)Online publication date: 1-Jan-2023
  • (2023)Soft Error Assessment MethodologyEarly Soft Error Reliability Assessment of Convolutional Neural Networks Executing on Resource-Constrained IoT Edge Devices10.1007/978-3-031-18599-1_4(63-74)Online publication date: 1-Jan-2023
  • (2022)The Impact of Soft Errors in Memory Units of Edge Devices Executing Convolutional Neural NetworksIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2022.314124369:3(679-683)Online publication date: Mar-2022
  • (2022)Impact of Thread Parallelism on the Soft Error Reliability of Convolution Neural Networks2022 IEEE 13th Latin America Symposium on Circuits and System (LASCAS)10.1109/LASCAS53948.2022.9789088(1-4)Online publication date: 1-Mar-2022
  • (2022)Silent Data Corruption Estimation and Mitigation Without Fault InjectionIEEE Canadian Journal of Electrical and Computer Engineering10.1109/ICJECE.2022.318904345:3(318-327)Online publication date: Oct-2023
  • (2022)SOFIA: An automated framework for early soft error assessment, identification, and mitigationJournal of Systems Architecture10.1016/j.sysarc.2022.102710131(102710)Online publication date: Oct-2022
  • (2022)Regional soft error vulnerability and error propagation analysis for GPGPU applicationsThe Journal of Supercomputing10.1007/s11227-021-04026-678:3(4095-4130)Online publication date: 1-Feb-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media