ABSTRACT
As technology scales and the energy of computation continually approaches thermal equilibrium [1,2], parameter variations and noise levels will lead to larger error rates at various levels of the computation stack. The error rates would be especially high for post-CMOS and nanoelectronic systems as well as for probabilistic [3] and stochastic architectures [4]. N-modular redundancy (NMR) at the core-level has been proposed as a way to attain system reliability goals for multicore architectures. While core-level DMR and TMR have been shown to be effective when errors are rare, a large amount of core-level redundancy will be required for attaining system reliability goals in face of high error rates. This makes voting latency and bandwidth significant performance bottlenecks for such systems. In this paper, we present a scalable NMR framework for error prone chip multiprocessors(CMPs). The framework supports in-network fault tolerance where voting logic is integrated into routers to allow for truly distributed voting. The in-network fault tolerance router utilizes the expected redundancy in vote messages, to reduce some of the blocking overhead incurred at the leader, and also provide a mechanism to trade-off network bandwidth with latency. Our framework also supports proactive checkpoint deallocation which allows cores participating in voting to continue on with execution instead of waiting on notification from the voting logic. Finally, the framework supports dynamic constitution that allows an arbitrary core on this chip to be a part of an NMR group. This allows bypassing faulty cores as well as scheduling for performance. Our experiments show significant performance/bandwidth benefits from these optimizations.
- International Technology Roadmap for Semiconductors 2005, http://public.itrs.net.Google Scholar
- C. Constantinescu, "Trends and challenges in vlsi circuit reliability,"Micro, IEEE, vol. 23, no. 4, pp. 14--19, July-Aug. 2003. Google ScholarDigital Library
- L. N. Chakrapani, P. Korkmaz, B. E. S. Akgul, and K. V. Palem, "Probabilistic system-on-a-chip architectures," ACM Trans. Des. Autom. Electron. Syst., vol. 12, no. 3, pp. 1--28, 2007. Google ScholarDigital Library
- Stochastic Processors (or processors that do not always compute correctly by design), NSF Workshop on Science of Power Management. {Online}. Available: http://scipm.cs.vt.edu/Slides/2.RakeshKumar.pdfGoogle Scholar
- P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi, "Modeling the effect of technology trends on the soft error rate of combinational logic," 2002, pp. 389--398. Google ScholarDigital Library
- S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, "A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor," in MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture. Washington, DC, USA: IEEE Computer Society, 2003, p. 29. Google ScholarDigital Library
- N. J. Wang, J. Quek, T. M. Rafacz, and S. J. patel, Characterizing the effects of transient faults on a high-performance processor pipeline," in DSN '04: Proceedings of the 2004 International Conference on Dependable Systems and Networks. Washington, DC, USA: IEEE Computer Society, 2004, p. 61. Google ScholarDigital Library
- A. Ionescu, "New functionality and ultra low power: key opportunities for post-cmos era," April 2008, pp. 72--73.Google Scholar
- K. Tsukagoshi, N. Yoneya, S. Uryu, Y. Aoyagi, A. Kanda, Y. Ootuka, and B. W. Alphenaar, "Carbon nanotube devices for nanoelectronics," Physica B: Condensed Matter, vol. 323, no. 1--4, pp. 107 -- 114, 2002, proceedings of the Tsukuba Symposium on Carbon Nanotube in Commemoration of the 10th Anniversary of its Discovery.Google Scholar
- A. van Roosmalen and G. Zhang, "Reliability challenges in the nanoelectronics era,"Microelectronics and Reliability, vol. 46, no. 9--11, pp. 1403 -- 1414, 2006, proceedings of the 17th European Symposium on Reliability of Electron Devices, Failure Physics and Analysis. Wuppertal, Germany 3rd-6th October 2006.Google Scholar
- Predictive Technology Model, Arizon State University, School of Engineering. {Online}. Available: http://www.eas.asu.edu/ ptm/Google Scholar
- B. C. Paul, S. Fujita, M. Okajima, and T. Lee, Modeling and analysis of circuit performance of ballistic cnfet," in DAC '06: Proceedings of the 43rd annual conference on Design automation. New York, NY, USA: ACM, 2006, pp. 717--722. Google ScholarDigital Library
- M. L. Shooman, Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design. New York, NY, USA: John Wiley&Sons, Inc., 2002. Google ScholarDigital Library
- D. Siewiorek, V. Kini, H. Mashburn, S. McConnel, and M. Tsao, "A case study of c.mmp, cm*, and c.vmp: Part i.experiences with fault tolerance in multiprocessor systems," Proceedings of the IEEE, vol. 66, no. 10, pp. 1178--1199, Oct. 1978.Google ScholarCross Ref
- D. Avresky, S. Geoghegan, and Y. Varoglu, Evaluation of software-implemented fault-tolerance (sift) approach in gracefully degradable multi-computer systems," Reliability, IEEE Transactions on, vol. 55, no. 3, pp. 451--457, Sept. 2006.Google Scholar
- J. Hopkins, A.L., I. Smith, T.B., and J. Lala, "Ftmp.a highly reliable fault-tolerant multiprocess for aircraft, Proceedings of the IEEE, vol. 66, no. 10, pp. 1221--1239, Oct. 1978.Google ScholarCross Ref
- T. M. Austin, "Diva: A dynamic approach to microprocessor verification," Journal of Instruction-Level Parallelism, vol. 2, p. 2000, 2000.Google Scholar
- D. Jewett, "Integrity s2: a fault-tolerant unix platform," Fault-Tolerant Computing, 1991. FTCS-21.Google Scholar
- Digest of Papers., Twenty-First International Symposium, pp. 512--519, Jun 1991.Google Scholar
- P. N. Sanda, J. W. Kellington, P. Kudva, R. Kalla, R. B. McBeth, J. Ackaret, R. Lockwood, J. Schumann, and C. R. Jones, "Soft-error resilience of the ibm power6 processor," IBM Journal of Research and Development, vol. 52, no. 3, pp. 275--284, 2008. Google ScholarDigital Library
- J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe, Reunion: Complexity-effective multicore redundancy," Microarchitecture, IEEE/ACM International Symposium on, vol. 0, pp. 223--234, 2006. Google ScholarDigital Library
- C. LaFrieda, E. Ipek, J. F. Martinez, and R. Manohar, Utilizing dynamically coupled cores to form a resilient chip multiprocessor," in DSN '07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Washington, DC, USA: IEEE Computer Society, 2007, pp. 317--326. Google ScholarDigital Library
- A. Golander, S. Weiss, and R. Ronen, "Ddmr: Dynamic and scalable dual modular redundancy with short validation intervals," Computer Architecture Letters, vol. 7, no. 2, pp. 65--68, Feb. 2008. Google ScholarDigital Library
- D. Sanchez, J. L. Arag´on, and J. M. Garcia, Evaluating dynamic core coupling in a scalable tiled-cmp architecture," in Proc. of the 7th Int. Workshop on Duplicating, Deconstructing, and Debunking (WDDD), in conjunction with ISCA'08, Jun 2008.Google Scholar
- A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir, "The nyu ultracomputer-designing a mimd, shared-memory parallel machine," in ISCA '98: 25 years of the international symposia on Computer architecture (selected papers). New York, NY, USA: ACM, 1998, pp. 239--254. Google ScholarDigital Library
- A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors, "Using process-level redundancy to exploit multiple cores for transient fault tolerance," in DSN '07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Washington, DC, USA: IEEE Computer Society, 2007, pp. 297--306. Google ScholarDigital Library
- The M5 Simulator System, University of Michigan, http://www.m5sim.org/wiki/index.php/mainpageGoogle Scholar
Index Terms
- Towards scalable reliability frameworks for error prone CMPs
Recommendations
A scalable micro wireless interconnect structure for CMPs
MobiCom '09: Proceedings of the 15th annual international conference on Mobile computing and networkingThis paper describes an unconventional way to apply wireless networking in emerging technologies. It makes the case for using a two-tier hybrid wireless/wired architecture to interconnect hundreds to thousands of cores in chip multiprocessors (CMPs), ...
Cable-geometric error-prone approach for low-latency interconnection networks
CCGrid '17: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid ComputingInterconnection network is a main concern in the architecture design of highly parallel systems such as high-density data centers and supercomputers that reach millions of endpoints, e.g., 10M cores for Sunway TaihuLight system. As the number of ...
Comments