skip to main content
10.1145/1629395.1629432acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
research-article

Towards scalable reliability frameworks for error prone CMPs

Authors Info & Claims
Published:11 October 2009Publication History

ABSTRACT

As technology scales and the energy of computation continually approaches thermal equilibrium [1,2], parameter variations and noise levels will lead to larger error rates at various levels of the computation stack. The error rates would be especially high for post-CMOS and nanoelectronic systems as well as for probabilistic [3] and stochastic architectures [4]. N-modular redundancy (NMR) at the core-level has been proposed as a way to attain system reliability goals for multicore architectures. While core-level DMR and TMR have been shown to be effective when errors are rare, a large amount of core-level redundancy will be required for attaining system reliability goals in face of high error rates. This makes voting latency and bandwidth significant performance bottlenecks for such systems. In this paper, we present a scalable NMR framework for error prone chip multiprocessors(CMPs). The framework supports in-network fault tolerance where voting logic is integrated into routers to allow for truly distributed voting. The in-network fault tolerance router utilizes the expected redundancy in vote messages, to reduce some of the blocking overhead incurred at the leader, and also provide a mechanism to trade-off network bandwidth with latency. Our framework also supports proactive checkpoint deallocation which allows cores participating in voting to continue on with execution instead of waiting on notification from the voting logic. Finally, the framework supports dynamic constitution that allows an arbitrary core on this chip to be a part of an NMR group. This allows bypassing faulty cores as well as scheduling for performance. Our experiments show significant performance/bandwidth benefits from these optimizations.

References

  1. International Technology Roadmap for Semiconductors 2005, http://public.itrs.net.Google ScholarGoogle Scholar
  2. C. Constantinescu, "Trends and challenges in vlsi circuit reliability,"Micro, IEEE, vol. 23, no. 4, pp. 14--19, July-Aug. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. N. Chakrapani, P. Korkmaz, B. E. S. Akgul, and K. V. Palem, "Probabilistic system-on-a-chip architectures," ACM Trans. Des. Autom. Electron. Syst., vol. 12, no. 3, pp. 1--28, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Stochastic Processors (or processors that do not always compute correctly by design), NSF Workshop on Science of Power Management. {Online}. Available: http://scipm.cs.vt.edu/Slides/2.RakeshKumar.pdfGoogle ScholarGoogle Scholar
  5. P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi, "Modeling the effect of technology trends on the soft error rate of combinational logic," 2002, pp. 389--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, "A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor," in MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture. Washington, DC, USA: IEEE Computer Society, 2003, p. 29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. J. Wang, J. Quek, T. M. Rafacz, and S. J. patel, Characterizing the effects of transient faults on a high-performance processor pipeline," in DSN '04: Proceedings of the 2004 International Conference on Dependable Systems and Networks. Washington, DC, USA: IEEE Computer Society, 2004, p. 61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Ionescu, "New functionality and ultra low power: key opportunities for post-cmos era," April 2008, pp. 72--73.Google ScholarGoogle Scholar
  9. K. Tsukagoshi, N. Yoneya, S. Uryu, Y. Aoyagi, A. Kanda, Y. Ootuka, and B. W. Alphenaar, "Carbon nanotube devices for nanoelectronics," Physica B: Condensed Matter, vol. 323, no. 1--4, pp. 107 -- 114, 2002, proceedings of the Tsukuba Symposium on Carbon Nanotube in Commemoration of the 10th Anniversary of its Discovery.Google ScholarGoogle Scholar
  10. A. van Roosmalen and G. Zhang, "Reliability challenges in the nanoelectronics era,"Microelectronics and Reliability, vol. 46, no. 9--11, pp. 1403 -- 1414, 2006, proceedings of the 17th European Symposium on Reliability of Electron Devices, Failure Physics and Analysis. Wuppertal, Germany 3rd-6th October 2006.Google ScholarGoogle Scholar
  11. Predictive Technology Model, Arizon State University, School of Engineering. {Online}. Available: http://www.eas.asu.edu/ ptm/Google ScholarGoogle Scholar
  12. B. C. Paul, S. Fujita, M. Okajima, and T. Lee, Modeling and analysis of circuit performance of ballistic cnfet," in DAC '06: Proceedings of the 43rd annual conference on Design automation. New York, NY, USA: ACM, 2006, pp. 717--722. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. L. Shooman, Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design. New York, NY, USA: John Wiley&Sons, Inc., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Siewiorek, V. Kini, H. Mashburn, S. McConnel, and M. Tsao, "A case study of c.mmp, cm*, and c.vmp: Part i.experiences with fault tolerance in multiprocessor systems," Proceedings of the IEEE, vol. 66, no. 10, pp. 1178--1199, Oct. 1978.Google ScholarGoogle ScholarCross RefCross Ref
  15. D. Avresky, S. Geoghegan, and Y. Varoglu, Evaluation of software-implemented fault-tolerance (sift) approach in gracefully degradable multi-computer systems," Reliability, IEEE Transactions on, vol. 55, no. 3, pp. 451--457, Sept. 2006.Google ScholarGoogle Scholar
  16. J. Hopkins, A.L., I. Smith, T.B., and J. Lala, "Ftmp.a highly reliable fault-tolerant multiprocess for aircraft, Proceedings of the IEEE, vol. 66, no. 10, pp. 1221--1239, Oct. 1978.Google ScholarGoogle ScholarCross RefCross Ref
  17. T. M. Austin, "Diva: A dynamic approach to microprocessor verification," Journal of Instruction-Level Parallelism, vol. 2, p. 2000, 2000.Google ScholarGoogle Scholar
  18. D. Jewett, "Integrity s2: a fault-tolerant unix platform," Fault-Tolerant Computing, 1991. FTCS-21.Google ScholarGoogle Scholar
  19. Digest of Papers., Twenty-First International Symposium, pp. 512--519, Jun 1991.Google ScholarGoogle Scholar
  20. P. N. Sanda, J. W. Kellington, P. Kudva, R. Kalla, R. B. McBeth, J. Ackaret, R. Lockwood, J. Schumann, and C. R. Jones, "Soft-error resilience of the ibm power6 processor," IBM Journal of Research and Development, vol. 52, no. 3, pp. 275--284, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe, Reunion: Complexity-effective multicore redundancy," Microarchitecture, IEEE/ACM International Symposium on, vol. 0, pp. 223--234, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. LaFrieda, E. Ipek, J. F. Martinez, and R. Manohar, Utilizing dynamically coupled cores to form a resilient chip multiprocessor," in DSN '07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Washington, DC, USA: IEEE Computer Society, 2007, pp. 317--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Golander, S. Weiss, and R. Ronen, "Ddmr: Dynamic and scalable dual modular redundancy with short validation intervals," Computer Architecture Letters, vol. 7, no. 2, pp. 65--68, Feb. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Sanchez, J. L. Arag´on, and J. M. Garcia, Evaluating dynamic core coupling in a scalable tiled-cmp architecture," in Proc. of the 7th Int. Workshop on Duplicating, Deconstructing, and Debunking (WDDD), in conjunction with ISCA'08, Jun 2008.Google ScholarGoogle Scholar
  25. A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir, "The nyu ultracomputer-designing a mimd, shared-memory parallel machine," in ISCA '98: 25 years of the international symposia on Computer architecture (selected papers). New York, NY, USA: ACM, 1998, pp. 239--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors, "Using process-level redundancy to exploit multiple cores for transient fault tolerance," in DSN '07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Washington, DC, USA: IEEE Computer Society, 2007, pp. 297--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. The M5 Simulator System, University of Michigan, http://www.m5sim.org/wiki/index.php/mainpageGoogle ScholarGoogle Scholar

Index Terms

  1. Towards scalable reliability frameworks for error prone CMPs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CASES '09: Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
          October 2009
          298 pages
          ISBN:9781605586267
          DOI:10.1145/1629395

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 11 October 2009

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate52of230submissions,23%

          Upcoming Conference

          ESWEEK '24
          Twentieth Embedded Systems Week
          September 29 - October 4, 2024
          Raleigh , NC , USA

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader