skip to main content
10.1145/2818950.2818961acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article

E-ECC: Low Power Erasure and Error Correction Schemes for Increasing Reliability of Commodity DRAM Systems

Published:05 October 2015Publication History

ABSTRACT

Most server-grade memory systems provide Chipkill-Correct error protection at the expense of power and/or performance overhead. In this paper we present low overhead schemes for improving the reliability of commodity DRAM systems with better power and IPC performance compared to Chipkill-Correct solutions. Specifically, we propose two erasure and error correction (E-ECC) schemes for x8 memory systems that have 12.5% storage overhead and do not require any change in the existing memory architecture. Both schemes have superior error performance due to the use of a strong ECC code, namely, RS(36,32) over GF(28). Scheme 1 activates 18 chips per access and has stronger reliability compared to Chipkill-Correct solutions. If the location of the faulty chip is known, Scheme 1 can correct an additional random error in a second chip. Scheme 2 trades off reliability for higher energy efficiency by activating only 9 chips per access. It cannot correct random errors due to a chip failure but can detect them with 99.9986% probability, and once a chip is marked faulty due to persistent errors, it can correct all errors due to that chip. Synthesis results in 28nm node show that the RS (36,32) code results in a very low decoding latency that can be well-hidden in commodity memory systems and, therefore, it has minimal effect on the DRAM access latency. Evaluations based on SPEC CPU 2006 sequential and multi-programmed workloads show that compared to Chipkill-Correct, the proposed Schemes 1 and 2 improve IPC by an average of 3.2% (maximum of 13.8%) and 4.8% (maximum of 31.8%) and reduce the power consumption by an average of 16.2% (maximum of 25%) and 26.8% (maximum of 36%), respectively.

References

  1. T. J. Dell. A white paper on the benefits of Chipkill-Corret ECC for PC server main memory. In IBM Microelectronics, 1997.Google ScholarGoogle Scholar
  2. T. J. Dell. System RAS implications of DRAM soft errors. IBM Journal of Research and Development, 52(3):307--314, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. W. Slayman. Cache and Memory Error Detection, Correction, and Reduction Techniques for Terrestrial Servers and Workstations. IEEE Transactions on Device and Materials Reliability, 5(3):397--404, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  4. S. Li, K. Chen, M.-Y. Hsieh, N. Muralimanohar, C. Kersey, J. Brockman, A. Rodrigues, and N. Jouppi. System Implications of Memory Reliability in Exascale Computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 11, pages 1--12, Nov 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Hwang, I. Stefanovici, and B. Schroeder. Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design. SIGARCH Computer Architecture News, pages 111--122, Mar. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Schroeder, E. Pinheiro, and W. Weber. DRAM Errors in the Wild: A Large-Scale Field Study. In SIGMETRICS, pages 193--204, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. V. Sridharan and D. Liberty. A Field Study of DRAM Errors. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1--11, Nov 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Locklear. Dell Enterprise System Group, Aug, 2000.Google ScholarGoogle Scholar
  9. T. R. Rao and E. Fujiwara. Error-Control Coding for Computer Systems. Prentice-Hall Inc., 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. L. Chen. Symbol error correcting codes for memory applications. In Proceedings of Annual Symposium on Fault Tolerant Computing, pages 200--207, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Yoon and M. Erez. Virtualized ECC: Flexible Reliability in Main Memory. MICRO, 31(1):11--19, Jan. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. N. Udipi, N. Muralimanohar, N. Chatterjee, R. Balsubramonian, A. Davis, and N. P. Jouppi. Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores. In 2010 Annual International Symposium on Computer Architecture (ISCA), pages 175--186, Jun. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. N. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, and N. P. Jouppi. LOT-ECC: Localized and Tiered Reliability Mechanisms for Commodity Memory Systems. In International Symposium on Computer Architecture (ISCA), pages 285--296, Jun. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. X. Jian and R. Kumar. Adaptive Reliability Chipkill Correct (ARCC). In IEEE International Symposium on High Performance Computer Architecture, HPCA, pages 270--281, Feb. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Jian, H. Duwe, J. Sartori, V. Sridharan, and R. Kumar. Low-power, Low-storage-overhead Chipkill Correct via Multi-line Error Correction. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC13, pages 1--12, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Son, S. O, Y. Ro, J. Lee, and J. Ahn. Reducing Memory Access Latency with Asymmetric DRAM Bank Organizations. In 2013 40th Annual International Symposium on Computer Architecture (ISCA), volume 41, pages 380--391, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Jacob, S. Ng, and D. Wang. Memory Systems Cache, DRAM, Disk. Morgan Kaufmann; first edition, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi. Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC13, pages 1--11, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. DeBardeleben, S. Blanchard, V. Sridharan, S. Gurumurthi, J. Stearley, K. Ferreira, and J. Shalf. Extra Bits on SRAM and DRAM Errors - More Data From the Field. Silicon Errors in Logic - System Effects (SELSE-10), Stanford University, Apr. 2014.Google ScholarGoogle Scholar
  20. V. Sridharan, N. Debardeleben, S. Blanchard, K. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi. Memory Errors in Modern Systems: The Good, The Bad and The Ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 297--310, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Evain and V. Gherman. Error-correction schemes with erasure information for fast memories. In Test Symposium (ETS), 2013 18th IEEE European, pages 1--6, May 2013.Google ScholarGoogle ScholarCross RefCross Ref
  22. D. Yoon, J. Chang, N. Muralimanohar, and P. Ranganathan. BOOM: Enabling mobile memory based low-power server DIMMs. In 39th Annual International Symposium on Computer Architecture (ISCA), pages 25--36, Jun. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. AMD64 architecture programmer's manual revision 3.17. 2011.Google ScholarGoogle Scholar
  24. Reliability, Availibility, and Serviceability. Features of the IBM eX5 Portfolio. 2012.Google ScholarGoogle Scholar
  25. Intel Xeon Processor E7 Family: Reliability, Availability and Serviceability: Advanced data integrity and resiliency support for mission-critical deployment. 2011.Google ScholarGoogle Scholar
  26. HP: How memory RAS technologies can enhance the uptime of HP Proliant servers. 2013.Google ScholarGoogle Scholar
  27. M. Sullivan J. Kim and M. Erez. Bamboo ECC: Strong, Safe, and Flexible Codes for Reliable Computer Memory. In IEEE International Symposium on High Performance Computer Architecture, HPCA, pages 101--112, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  28. Memory for Dell Poweredge 12th Generation Servers. 2012.Google ScholarGoogle Scholar
  29. N. Binkert, B. Beckmann, G. Black, S. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. Hill, and D. Wood. The Gem5 Simulator. SIGARCH Computer Archittecture News, 39(2), Aug. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters, 10(1):16--19, Jan. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Micron. DDR3 SDRAM System-Power Calculator, 2011.Google ScholarGoogle Scholar
  32. E. Perelman, G. Hamerly, M. Biesbrouck, T. Sherwood, and B. Calder. Using Si'mPoint for Accurate and Efficient Simulation. In Proceedings of the 2003 ACM International Conference on Measurement and Modeling of Computer Systems, In SIGMETRICS, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. H. Jeon, G. Loh, and M. Annavaram. Efficient RAS support for die-stacked DRAM. In IEEE International Test Conference (ITC), pages 1--10, Oct. 2014.Google ScholarGoogle ScholarCross RefCross Ref
  34. R.-H. Deng and D. J. Costello. Decoding of DBEC-TBED Reed-Solomon Codes. IEEE Transactions on Computers, C-36(11):1359--1363, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Fenn, M. Benaissa, and D. Taylor. Decoding double-error-correcting Reed-Solomon codes. IEE Proceedings on Communications, 142(6):345--348, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  36. R. Berlekamp. Algebraic Coding Theory. McGraw-Hill, 1968.Google ScholarGoogle Scholar

Index Terms

  1. E-ECC: Low Power Erasure and Error Correction Schemes for Increasing Reliability of Commodity DRAM Systems

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Other conferences
                  MEMSYS '15: Proceedings of the 2015 International Symposium on Memory Systems
                  October 2015
                  278 pages
                  ISBN:9781450336048
                  DOI:10.1145/2818950

                  Copyright © 2015 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 5 October 2015

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed limited

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader