ABSTRACT
As circuit geometries continue to shrink, and supply voltages remain relatively constant, circuit wearout becomes a concern. We propose that the relative reliability of the circuits of a processor be exposed to the operating system, and be managed by a credit-based wearout monitor. This wearout monitor receives dynamic updates of the reliability of circuits through the use of stability detector circuits that are small enough to be widely deployed. We find that through the combined use of the wearout monitor and stability detectors, we can efficiently and accurately manage the reliability of a processor, and re-coup the performance of a processor that would otherwise be lost when processors are over-provisioned to meet an expected lifetime. We simulate a 16 core DSP with a wearout monitor and stability detectors on a mix of four different media algorithms. Using the wearout monitor and stability detectors, we find that by reducing average performance by only 5%, we can increase the lifetime of the processor by 46%.
- P. Franco and E. McCluskey, "On-line delay testing of digital circuits," in Proceedings, 12th IEEE VLSI Test Symposium, 1994, IEEE Computer Society, 1994.Google Scholar
- J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "The impact of technology scaling on lifetime reliability," in In Proc. of International Conference on Dependable Systems and Networks (DSN), 2004., 2004. Google ScholarDigital Library
- J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "A reliability odometer - lemon check your processor!," in The Wild and Crazy Idea Session IV, in conjunction with ASPLOS XI, 2004.Google Scholar
- J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "The case for lifetime reliability-aware microprocessors," in ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture, (Washington, DC, USA), p. 276, IEEE Computer Society, 2004. Google ScholarDigital Library
- K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, "Temperature-aware microarchitecture," in ISCA '03: Proceedings of the 30th annual international symposium on Computer architecture, (New York, NY, USA), pp. 2--13, ACM Press, 2003. Google ScholarDigital Library
- Z. Lu, J. Lach, M. R. Stan, and K. Skadron, "Improved thermal management with reliability banking," IEEE Micro, vol. 25, no. 6, pp. 40--49, 2005. Google ScholarDigital Library
- Z. Lu, W. Huang, J. Lach, M. Stan, and K. Skadron, "Interconnect lifetime prediction under dynamic stress for reliability-aware design," in ICCAD '04: Proceedings of the 2004 IEEE/ACM International conference on Computer-aided design, (Washington, DC, USA), pp. 327--334, IEEE Computer Society, 2004. Google ScholarDigital Library
- J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "Exploiting structural duplication for lifetime reliability enhancement," in ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture, (Washington, DC, USA), pp. 520--531, IEEE Computer Society, 2005. Google ScholarDigital Library
- J. Blome, S. Gupta, S. Feng, S. Mahlke, and D. Bradley, "Online timing analysis for wearout detection," in The Second Workshop on Architectural Reliability (WAR), 2006., 2006.Google Scholar
- J. Blome, S. Feng, S. Gupta, and S. Mahlke, "Self calibrating online wearout detection," MICRO 40: Proceedings of the 40th annual ACM/IEEE international symposium on Microarchitecture, 2007. Google ScholarDigital Library
- D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge, "Razor: A low-power pipeline based on circuit-level timing speculation," in MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, (Washington, DC, USA), p. 7, IEEE Computer Society, 2003. Google ScholarDigital Library
- Joint Electron Device Engineering Council, "Failure mechanisms and models for semiconductor devices." www.jedec.org/ download/search/jep122C.pdf, 2006.Google Scholar
- ITRS, International Technology Roadmap For Semiconductors - 2006 Edition, System Drivers. Semiconductor Industry Association, 2006.Google Scholar
- P. Franco, "Testing digital circuits for timing failures by output waveform analysis," Dissertation, Stanford University, 1994.Google Scholar
- K. Bernstein, D. J. Frank, A. E. Gattiker, W. Haensch, B. L. Ji, S. R. Nassif, E. J. Nowak, D. J. Pearson, and N. J. Rohrer, "High-performance cmos variability in the 65-nm regime and beyond," IBM J. Res. Dev., vol. 50, no. 4/5, pp. 433--449, 2006. Google ScholarDigital Library
- D. C. Burger and T. M. Austin, "The simplescalar tool set, version 2.0," Technical Report CS-TR-1997-1342, University of Wisconsin, Madison, June 1997.Google ScholarDigital Library
- B. Hendrickson and R. Leland, "The chaco user's guide, version 2.0, technical report sand94-2692," 1994. http://www.ti.com/ corp/docs/press/backgrounder/omap.shtml.Google Scholar
- U. SMART Interconnect Group, "Flexsim 1.2 flit level simulator." http://ceng.usc.edu/smart/tools.html.Google Scholar
- X. Chen and L.-S. Peh, "Leakage power modeling and optimization in interconnection networks," in ISLPED '03: Proceedings of the 2003 international symposium on Low power electronics and design, pp. 90--95, ACM Press, 2003. Google ScholarDigital Library
- R. Ho, K. Mai, and M. Horowitz, "The future of wires," in Proceedings of the IEEE, vol. 89, pp. 490--504, April 2001.Google ScholarCross Ref
- R. Ho, K. Mai, and M. Horowitz, "Efficient on-chip global interconnects," in IEEE Symposium on VLSI Circuits, June 2003. Stanford Univeristy.Google Scholar
Index Terms
- Credit-based dynamic reliability management using online wearout detection
Recommendations
Sensor-Driven Reliability and Wearout Management
Editor's note:Gate oxide degradation is a key limiter to semiconductor reliability. Because of variations in gate oxide thickness, however, product reliability is often guaranteed by designing for the worst case. This article describes the use of oxide-...
System-level modeling and microprocessor reliability analysis for backend wearout mechanisms
DATE '13: Proceedings of the Conference on Design, Automation and Test in EuropeBackend wearout mechanisms are major reliability concerns for modern microprocessors. In this paper, a framework which contains modules for backend time-dependent dielectric breakdown (BTDDB), electromigration (EM), and stress-induced voiding (SIV) is ...
Implications of fin width scaling on variability and reliability of high-k metal gate FinFETs
In this paper, we report a study to understand the fin width dependence on performance, variability and reliability of n-type and p-type triple-gate fin field effect transistors (FinFETs) with high-k dielectric and metal gate. Our results indicate that ...
Comments