ABSTRACT
Higher transistor counts, lower voltage levels, and reduced noise margin increase the susceptibility of multicore processors to transient faults. Redundant hardware modules can detect such errors, but software transient fault detection techniques are more appealing for their low cost and flexibility. Recent software proposals double register pressure or memory usage, or are too slow in the absence of hardware extensions, preventing widespread acceptance. This paper presents DAFT, a fast, safe, and memory efficient transient fault detection framework for commodity multicore systems. DAFT replicates computation across multiple cores and schedules fault detection off the critical path. Where possible, values are speculated to be correct and only communicated to the redundant thread at essential program points. DAFT is implemented in the LLVM compiler framework and evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a commodity multicore system. Results demonstrate DAFT's high performance and broad fault coverage. Speculation allows DAFT to reduce the perfor- mance overhead of software redundant multithreading from an average of 200% to 38% with no degradation of fault coverage.
- }}E. D. Berger and B. G. Zorn. Diehard: Probabilistic memory safety for unsafe languages. In Proceedings of the ACM SIGPLAN '06 Conference on Programming Language Design and Implementation, June 2006. Google ScholarDigital Library
- }}S.S.Brilliant,J.C.Knight,andN.G.Leveson.Analysisoffaultsin an n-version software experiment. IEEE Trans. Softw. Eng., 16(2):238--247, 1990. Google ScholarDigital Library
- }}M.Gomaa,C.Scarbrough,T.N.Vijaykumar,andI.Pomeranz. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th annual international symposium on Computer architecture, pages 98--109. ACM Press, 2003. Google ScholarDigital Library
- }}R.W.Horst,R.L.Harris,andR.L.Jardine.Multipleinstruction issue in the NonStop Cyclone processor. In Proceedings of the 17th International Symposium on Computer Architecture, pages 216--226, May 1990. Google ScholarDigital Library
- }}C.LattnerandV.Adve.LLVM:Acompilationframeworkfor lifelong program analysis & transformation. In CGO '04: Proceedings of the International Symposium on Code Generation and Optimization, page 75, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarDigital Library
- }}C.-K.Luk,R.Cohn,R.Muth,H.Patil,A.Klauser,G.Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI '05, pages 190--200, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- }}A.MahmoodandE.J.McCluskey.Concurrenterrordetectionusing watchdog processors-a survey. IEEE Transactions on Computers, 37(2):160--174, 1988. Google ScholarDigital Library
- }}S.E.Michalak,K.W.Harris,N.W.Hengartner,B.E.Takala,and S. A. Wender. Predicting the number of fatal soft errors in los alamos national labratory's ASC Q computer. IEEE Transactions on Device and Materials Reliability, 5(3):329--335, September 2005.Google ScholarCross Ref
- }}S.S.Mukherjee,M.Kontz,andS.K.Reinhardt.Detaileddesignand evaluation of redundant multithreading alternatives. SIGARCH Comput. Archit. News, 30(2):99--110, 2002. Google ScholarDigital Library
- }}G.Novark,E.D.Berger,andB.G.Zorn.Exterminator: automatically correcting memory errors with high probability. In PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 1--11, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- }}S.K.ReinhardtandS.S.Mukherjee.Transientfaultdetectionvia simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 25--36. ACM Press, 2000. Google ScholarDigital Library
- }}G.A.Reis,J.Chang,N.Vachharajani,R.Rangan,andD.I.August. SWIFT: Software implemented fault tolerance. In Proceedings of the 3rd International Symposium on Code Generation and Optimization, March 2005. Google ScholarDigital Library
- }}G.A.Reis,J.Chang,N.Vachharajani,R.Rangan,D.I.August,and S. S. Mukherjee. Design and evaluation of hybrid fault-detection systems. In Proceedings of the 32th Annual International Symposium on Computer Architecture, pages 148--159, June 2005. Google ScholarDigital Library
- }}E.Rotenberg.AR-SMT:Amicroarchitecturalapproachtofault tolerance in microprocessors. In Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, page 84. IEEE Computer Society, 1999. Google ScholarDigital Library
- }}A.Shye,T.Moseley,V.J.Reddi,J.B.t,andD.A.Connors.Using process-level redundancy to exploit multiple cores for transient fault tolerance. Dependable Systems and Networks, International Conference on, 0:297--306, 2007. Google ScholarDigital Library
- }}T.J.Slegel,R.M.AverillIII,M.A.Check,B.C.Giamei,B.W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum, and C. F. Webb. IBM's S/390 G5 Microprocessor design. In IEEE Micro, volume 19, pages 12--23, March 1999. Google ScholarDigital Library
- }}N.Vachharajani,R.Rangan,E.Raman,M.J.Bridges,G.Ottoni,and D. I. August. Speculative decoupled software pipelining. In PACT '07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 49--59, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
- }}D.Walker,L.Mackey,J.Ligatti,G.A.Reis,andD.I.August.Static typing for a faulty lambda calculus. SIGPLAN Not., 41(9):38--49, 2006. Google ScholarDigital Library
- }}C.Wang,H.-S.Kim,Y.Wu,andV.Ying.Compiler-managed software-based redundant multi-threading for transient fault detection. In CGO '07: Proceedings of the International Symposium on Code Generation and Optimization, pages 244--258, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
- }}Y.Yeh.Triple-tripleredundant777primaryflightcomputer.In Proceedings of the 1996 IEEE Aerospace Applications Conference, volume 1, pages 293--307, February 1996.Google Scholar
- }}Y.Yeh.DesignconsiderationsinBoeing777fly-by-wirecomputers. In Proceedings of the Third IEEE International High-Assurance Systems Engineering Symposium, pages 64 -- 72, November 1998. Google ScholarDigital Library
Index Terms
- DAFT: decoupled acyclic fault tolerance
Recommendations
An evaluation of speculative instruction execution on simultaneous multithreaded processors
Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...
A fault-tolerant, dynamically scheduled pipeline structure for chip multiprocessors
SAFECOMP'11: Proceedings of the 30th international conference on Computer safety, reliability, and securityThis paper presents a dynamically scheduled pipeline structure for chip multiprocessors (CMPs). This technique exploits existing Simultaneous Multithreading (SMT), superscalar chip multiprocessors' redundancy to provide low-overhead, and broad coverage ...
A New Approach for Transient Fault Injection Using Symbolic Simulation
IOLTS '08: Proceedings of the 2008 14th IEEE International On-Line Testing SymposiumOne effective fault injection approach involves instrumenting the RTL in a controlled manner to incorporate fault injection, and evaluating the behaviour of the faulty RTL whilst running some benchmark programs. This approach relies on checking the ...
Comments