skip to main content
10.1145/1854273.1854289acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

DAFT: decoupled acyclic fault tolerance

Published:11 September 2010Publication History

ABSTRACT

Higher transistor counts, lower voltage levels, and reduced noise margin increase the susceptibility of multicore processors to transient faults. Redundant hardware modules can detect such errors, but software transient fault detection techniques are more appealing for their low cost and flexibility. Recent software proposals double register pressure or memory usage, or are too slow in the absence of hardware extensions, preventing widespread acceptance. This paper presents DAFT, a fast, safe, and memory efficient transient fault detection framework for commodity multicore systems. DAFT replicates computation across multiple cores and schedules fault detection off the critical path. Where possible, values are speculated to be correct and only communicated to the redundant thread at essential program points. DAFT is implemented in the LLVM compiler framework and evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a commodity multicore system. Results demonstrate DAFT's high performance and broad fault coverage. Speculation allows DAFT to reduce the perfor- mance overhead of software redundant multithreading from an average of 200% to 38% with no degradation of fault coverage.

References

  1. }}E. D. Berger and B. G. Zorn. Diehard: Probabilistic memory safety for unsafe languages. In Proceedings of the ACM SIGPLAN '06 Conference on Programming Language Design and Implementation, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. }}S.S.Brilliant,J.C.Knight,andN.G.Leveson.Analysisoffaultsin an n-version software experiment. IEEE Trans. Softw. Eng., 16(2):238--247, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. }}M.Gomaa,C.Scarbrough,T.N.Vijaykumar,andI.Pomeranz. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th annual international symposium on Computer architecture, pages 98--109. ACM Press, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. }}R.W.Horst,R.L.Harris,andR.L.Jardine.Multipleinstruction issue in the NonStop Cyclone processor. In Proceedings of the 17th International Symposium on Computer Architecture, pages 216--226, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. }}C.LattnerandV.Adve.LLVM:Acompilationframeworkfor lifelong program analysis & transformation. In CGO '04: Proceedings of the International Symposium on Code Generation and Optimization, page 75, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. }}C.-K.Luk,R.Cohn,R.Muth,H.Patil,A.Klauser,G.Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI '05, pages 190--200, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. }}A.MahmoodandE.J.McCluskey.Concurrenterrordetectionusing watchdog processors-a survey. IEEE Transactions on Computers, 37(2):160--174, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. }}S.E.Michalak,K.W.Harris,N.W.Hengartner,B.E.Takala,and S. A. Wender. Predicting the number of fatal soft errors in los alamos national labratory's ASC Q computer. IEEE Transactions on Device and Materials Reliability, 5(3):329--335, September 2005.Google ScholarGoogle ScholarCross RefCross Ref
  9. }}S.S.Mukherjee,M.Kontz,andS.K.Reinhardt.Detaileddesignand evaluation of redundant multithreading alternatives. SIGARCH Comput. Archit. News, 30(2):99--110, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. }}G.Novark,E.D.Berger,andB.G.Zorn.Exterminator: automatically correcting memory errors with high probability. In PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 1--11, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. }}S.K.ReinhardtandS.S.Mukherjee.Transientfaultdetectionvia simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 25--36. ACM Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. }}G.A.Reis,J.Chang,N.Vachharajani,R.Rangan,andD.I.August. SWIFT: Software implemented fault tolerance. In Proceedings of the 3rd International Symposium on Code Generation and Optimization, March 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. }}G.A.Reis,J.Chang,N.Vachharajani,R.Rangan,D.I.August,and S. S. Mukherjee. Design and evaluation of hybrid fault-detection systems. In Proceedings of the 32th Annual International Symposium on Computer Architecture, pages 148--159, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. }}E.Rotenberg.AR-SMT:Amicroarchitecturalapproachtofault tolerance in microprocessors. In Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, page 84. IEEE Computer Society, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. }}A.Shye,T.Moseley,V.J.Reddi,J.B.t,andD.A.Connors.Using process-level redundancy to exploit multiple cores for transient fault tolerance. Dependable Systems and Networks, International Conference on, 0:297--306, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. }}T.J.Slegel,R.M.AverillIII,M.A.Check,B.C.Giamei,B.W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum, and C. F. Webb. IBM's S/390 G5 Microprocessor design. In IEEE Micro, volume 19, pages 12--23, March 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. }}N.Vachharajani,R.Rangan,E.Raman,M.J.Bridges,G.Ottoni,and D. I. August. Speculative decoupled software pipelining. In PACT '07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 49--59, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. }}D.Walker,L.Mackey,J.Ligatti,G.A.Reis,andD.I.August.Static typing for a faulty lambda calculus. SIGPLAN Not., 41(9):38--49, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. }}C.Wang,H.-S.Kim,Y.Wu,andV.Ying.Compiler-managed software-based redundant multi-threading for transient fault detection. In CGO '07: Proceedings of the International Symposium on Code Generation and Optimization, pages 244--258, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. }}Y.Yeh.Triple-tripleredundant777primaryflightcomputer.In Proceedings of the 1996 IEEE Aerospace Applications Conference, volume 1, pages 293--307, February 1996.Google ScholarGoogle Scholar
  21. }}Y.Yeh.DesignconsiderationsinBoeing777fly-by-wirecomputers. In Proceedings of the Third IEEE International High-Assurance Systems Engineering Symposium, pages 64 -- 72, November 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DAFT: decoupled acyclic fault tolerance

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques
      September 2010
      596 pages
      ISBN:9781450301787
      DOI:10.1145/1854273

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 September 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate121of471submissions,26%

      Upcoming Conference

      PACT '24
      International Conference on Parallel Architectures and Compilation Techniques
      October 14 - 16, 2024
      Southern California , CA , USA

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader