ABSTRACT
Most server-grade memory systems provide Chipkill-Correct error protection at the expense of power and/or performance overhead. In this paper we present low overhead schemes for improving the reliability of commodity DRAM systems with better power and IPC performance compared to Chipkill-Correct solutions. Specifically, we propose two erasure and error correction (E-ECC) schemes for x8 memory systems that have 12.5% storage overhead and do not require any change in the existing memory architecture. Both schemes have superior error performance due to the use of a strong ECC code, namely, RS(36,32) over GF(28). Scheme 1 activates 18 chips per access and has stronger reliability compared to Chipkill-Correct solutions. If the location of the faulty chip is known, Scheme 1 can correct an additional random error in a second chip. Scheme 2 trades off reliability for higher energy efficiency by activating only 9 chips per access. It cannot correct random errors due to a chip failure but can detect them with 99.9986% probability, and once a chip is marked faulty due to persistent errors, it can correct all errors due to that chip. Synthesis results in 28nm node show that the RS (36,32) code results in a very low decoding latency that can be well-hidden in commodity memory systems and, therefore, it has minimal effect on the DRAM access latency. Evaluations based on SPEC CPU 2006 sequential and multi-programmed workloads show that compared to Chipkill-Correct, the proposed Schemes 1 and 2 improve IPC by an average of 3.2% (maximum of 13.8%) and 4.8% (maximum of 31.8%) and reduce the power consumption by an average of 16.2% (maximum of 25%) and 26.8% (maximum of 36%), respectively.
- T. J. Dell. A white paper on the benefits of Chipkill-Corret ECC for PC server main memory. In IBM Microelectronics, 1997.Google Scholar
- T. J. Dell. System RAS implications of DRAM soft errors. IBM Journal of Research and Development, 52(3):307--314, 2008. Google ScholarDigital Library
- C. W. Slayman. Cache and Memory Error Detection, Correction, and Reduction Techniques for Terrestrial Servers and Workstations. IEEE Transactions on Device and Materials Reliability, 5(3):397--404, 2005.Google ScholarCross Ref
- S. Li, K. Chen, M.-Y. Hsieh, N. Muralimanohar, C. Kersey, J. Brockman, A. Rodrigues, and N. Jouppi. System Implications of Memory Reliability in Exascale Computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 11, pages 1--12, Nov 2011. Google ScholarDigital Library
- A. Hwang, I. Stefanovici, and B. Schroeder. Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design. SIGARCH Computer Architecture News, pages 111--122, Mar. 2012. Google ScholarDigital Library
- B. Schroeder, E. Pinheiro, and W. Weber. DRAM Errors in the Wild: A Large-Scale Field Study. In SIGMETRICS, pages 193--204, 2009. Google ScholarDigital Library
- V. Sridharan and D. Liberty. A Field Study of DRAM Errors. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1--11, Nov 2012. Google ScholarDigital Library
- D. Locklear. Dell Enterprise System Group, Aug, 2000.Google Scholar
- T. R. Rao and E. Fujiwara. Error-Control Coding for Computer Systems. Prentice-Hall Inc., 1989. Google ScholarDigital Library
- C. L. Chen. Symbol error correcting codes for memory applications. In Proceedings of Annual Symposium on Fault Tolerant Computing, pages 200--207, 1996. Google ScholarDigital Library
- D. Yoon and M. Erez. Virtualized ECC: Flexible Reliability in Main Memory. MICRO, 31(1):11--19, Jan. 2011. Google ScholarDigital Library
- A. N. Udipi, N. Muralimanohar, N. Chatterjee, R. Balsubramonian, A. Davis, and N. P. Jouppi. Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores. In 2010 Annual International Symposium on Computer Architecture (ISCA), pages 175--186, Jun. 2010. Google ScholarDigital Library
- A. N. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, and N. P. Jouppi. LOT-ECC: Localized and Tiered Reliability Mechanisms for Commodity Memory Systems. In International Symposium on Computer Architecture (ISCA), pages 285--296, Jun. 2012. Google ScholarDigital Library
- X. Jian and R. Kumar. Adaptive Reliability Chipkill Correct (ARCC). In IEEE International Symposium on High Performance Computer Architecture, HPCA, pages 270--281, Feb. 2013. Google ScholarDigital Library
- X. Jian, H. Duwe, J. Sartori, V. Sridharan, and R. Kumar. Low-power, Low-storage-overhead Chipkill Correct via Multi-line Error Correction. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC13, pages 1--12, 2013. Google ScholarDigital Library
- Y. Son, S. O, Y. Ro, J. Lee, and J. Ahn. Reducing Memory Access Latency with Asymmetric DRAM Bank Organizations. In 2013 40th Annual International Symposium on Computer Architecture (ISCA), volume 41, pages 380--391, 2013. Google ScholarDigital Library
- B. Jacob, S. Ng, and D. Wang. Memory Systems Cache, DRAM, Disk. Morgan Kaufmann; first edition, 2007. Google ScholarDigital Library
- V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi. Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC13, pages 1--11, 2013. Google ScholarDigital Library
- N. DeBardeleben, S. Blanchard, V. Sridharan, S. Gurumurthi, J. Stearley, K. Ferreira, and J. Shalf. Extra Bits on SRAM and DRAM Errors - More Data From the Field. Silicon Errors in Logic - System Effects (SELSE-10), Stanford University, Apr. 2014.Google Scholar
- V. Sridharan, N. Debardeleben, S. Blanchard, K. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi. Memory Errors in Modern Systems: The Good, The Bad and The Ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 297--310, 2015. Google ScholarDigital Library
- S. Evain and V. Gherman. Error-correction schemes with erasure information for fast memories. In Test Symposium (ETS), 2013 18th IEEE European, pages 1--6, May 2013.Google ScholarCross Ref
- D. Yoon, J. Chang, N. Muralimanohar, and P. Ranganathan. BOOM: Enabling mobile memory based low-power server DIMMs. In 39th Annual International Symposium on Computer Architecture (ISCA), pages 25--36, Jun. 2012. Google ScholarDigital Library
- AMD64 architecture programmer's manual revision 3.17. 2011.Google Scholar
- Reliability, Availibility, and Serviceability. Features of the IBM eX5 Portfolio. 2012.Google Scholar
- Intel Xeon Processor E7 Family: Reliability, Availability and Serviceability: Advanced data integrity and resiliency support for mission-critical deployment. 2011.Google Scholar
- HP: How memory RAS technologies can enhance the uptime of HP Proliant servers. 2013.Google Scholar
- M. Sullivan J. Kim and M. Erez. Bamboo ECC: Strong, Safe, and Flexible Codes for Reliable Computer Memory. In IEEE International Symposium on High Performance Computer Architecture, HPCA, pages 101--112, 2015.Google ScholarCross Ref
- Memory for Dell Poweredge 12th Generation Servers. 2012.Google Scholar
- N. Binkert, B. Beckmann, G. Black, S. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. Hill, and D. Wood. The Gem5 Simulator. SIGARCH Computer Archittecture News, 39(2), Aug. 2011. Google ScholarDigital Library
- P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters, 10(1):16--19, Jan. 2011. Google ScholarDigital Library
- Micron. DDR3 SDRAM System-Power Calculator, 2011.Google Scholar
- E. Perelman, G. Hamerly, M. Biesbrouck, T. Sherwood, and B. Calder. Using Si'mPoint for Accurate and Efficient Simulation. In Proceedings of the 2003 ACM International Conference on Measurement and Modeling of Computer Systems, In SIGMETRICS, 2003. Google ScholarDigital Library
- H. Jeon, G. Loh, and M. Annavaram. Efficient RAS support for die-stacked DRAM. In IEEE International Test Conference (ITC), pages 1--10, Oct. 2014.Google ScholarCross Ref
- R.-H. Deng and D. J. Costello. Decoding of DBEC-TBED Reed-Solomon Codes. IEEE Transactions on Computers, C-36(11):1359--1363, 1987. Google ScholarDigital Library
- S. Fenn, M. Benaissa, and D. Taylor. Decoding double-error-correcting Reed-Solomon codes. IEE Proceedings on Communications, 142(6):345--348, 1995.Google ScholarCross Ref
- R. Berlekamp. Algebraic Coding Theory. McGraw-Hill, 1968.Google Scholar
Index Terms
- E-ECC: Low Power Erasure and Error Correction Schemes for Increasing Reliability of Commodity DRAM Systems
Recommendations
Virtualized and flexible ECC for main memory
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systemsWe present a general scheme for virtualizing main memory error-correction mechanisms, which map redundant information needed to correct errors into the memory namespace itself. We rely on this basic idea, which increases flexibility to increase error ...
Frugal ECC: efficient and versatile memory error protection through fine-grained compression
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisBecause main memory is vulnerable to errors and failures, large-scale systems and critical servers utilize error checking and correcting (ECC) mechanisms to meet their reliability requirements. We propose a novel mechanism, Frugal ECC (FECC), that ...
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design
ASPLOS '12Main memory is one of the leading hardware causes for machine crashes in today's datacenters. Designing, evaluating and modeling systems that are resilient against memory errors requires a good understanding of the underlying characteristics of errors ...
Comments