research-article

E-ECC: Low Power Erasure and Error Correction Schemes for Increasing Reliability of Commodity DRAM Systems

Authors:
Hsing-Min Chen

Arizona State University

Arizona State University
View Profile

,
Akhil Arunkumar

Arizona State University

Arizona State University
View Profile

,
Carole-Jean Wu

Arizona State University

Arizona State University
View Profile

,
Trevor Mudge

University of Michigan

University of Michigan
View Profile

,
Chaitali Chakrabarti

Arizona State University

Arizona State University
View Profile

MEMSYS '15: Proceedings of the 2015 International Symposium on Memory SystemsOctober 2015Pages 60–70https://doi.org/10.1145/2818950.2818961

Published:05 October 2015Publication History

MEMSYS '15: Proceedings of the 2015 International Symposium on Memory Systems

Pages 60–70

ABSTRACT

Most server-grade memory systems provide Chipkill-Correct error protection at the expense of power and/or performance overhead. In this paper we present low overhead schemes for improving the reliability of commodity DRAM systems with better power and IPC performance compared to Chipkill-Correct solutions. Specifically, we propose two erasure and error correction (E-ECC) schemes for x8 memory systems that have 12.5% storage overhead and do not require any change in the existing memory architecture. Both schemes have superior error performance due to the use of a strong ECC code, namely, RS(36,32) over GF(28). Scheme 1 activates 18 chips per access and has stronger reliability compared to Chipkill-Correct solutions. If the location of the faulty chip is known, Scheme 1 can correct an additional random error in a second chip. Scheme 2 trades off reliability for higher energy efficiency by activating only 9 chips per access. It cannot correct random errors due to a chip failure but can detect them with 99.9986% probability, and once a chip is marked faulty due to persistent errors, it can correct all errors due to that chip. Synthesis results in 28nm node show that the RS (36,32) code results in a very low decoding latency that can be well-hidden in commodity memory systems and, therefore, it has minimal effect on the DRAM access latency. Evaluations based on SPEC CPU 2006 sequential and multi-programmed workloads show that compared to Chipkill-Correct, the proposed Schemes 1 and 2 improve IPC by an average of 3.2% (maximum of 13.8%) and 4.8% (maximum of 31.8%) and reduce the power consumption by an average of 16.2% (maximum of 25%) and 26.8% (maximum of 36%), respectively.

References

T. J. Dell. A white paper on the benefits of Chipkill-Corret ECC for PC server main memory. In IBM Microelectronics, 1997.Google Scholar
T. J. Dell. System RAS implications of DRAM soft errors. IBM Journal of Research and Development, 52(3):307--314, 2008. Google ScholarDigital Library
C. W. Slayman. Cache and Memory Error Detection, Correction, and Reduction Techniques for Terrestrial Servers and Workstations. IEEE Transactions on Device and Materials Reliability, 5(3):397--404, 2005.Google ScholarCross Ref
S. Li, K. Chen, M.-Y. Hsieh, N. Muralimanohar, C. Kersey, J. Brockman, A. Rodrigues, and N. Jouppi. System Implications of Memory Reliability in Exascale Computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 11, pages 1--12, Nov 2011. Google ScholarDigital Library
A. Hwang, I. Stefanovici, and B. Schroeder. Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design. SIGARCH Computer Architecture News, pages 111--122, Mar. 2012. Google ScholarDigital Library
B. Schroeder, E. Pinheiro, and W. Weber. DRAM Errors in the Wild: A Large-Scale Field Study. In SIGMETRICS, pages 193--204, 2009. Google ScholarDigital Library
V. Sridharan and D. Liberty. A Field Study of DRAM Errors. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1--11, Nov 2012. Google ScholarDigital Library
D. Locklear. Dell Enterprise System Group, Aug, 2000.Google Scholar
T. R. Rao and E. Fujiwara. Error-Control Coding for Computer Systems. Prentice-Hall Inc., 1989. Google ScholarDigital Library
C. L. Chen. Symbol error correcting codes for memory applications. In Proceedings of Annual Symposium on Fault Tolerant Computing, pages 200--207, 1996. Google ScholarDigital Library
D. Yoon and M. Erez. Virtualized ECC: Flexible Reliability in Main Memory. MICRO, 31(1):11--19, Jan. 2011. Google ScholarDigital Library
A. N. Udipi, N. Muralimanohar, N. Chatterjee, R. Balsubramonian, A. Davis, and N. P. Jouppi. Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores. In 2010 Annual International Symposium on Computer Architecture (ISCA), pages 175--186, Jun. 2010. Google ScholarDigital Library
A. N. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, and N. P. Jouppi. LOT-ECC: Localized and Tiered Reliability Mechanisms for Commodity Memory Systems. In International Symposium on Computer Architecture (ISCA), pages 285--296, Jun. 2012. Google ScholarDigital Library
X. Jian and R. Kumar. Adaptive Reliability Chipkill Correct (ARCC). In IEEE International Symposium on High Performance Computer Architecture, HPCA, pages 270--281, Feb. 2013. Google ScholarDigital Library
X. Jian, H. Duwe, J. Sartori, V. Sridharan, and R. Kumar. Low-power, Low-storage-overhead Chipkill Correct via Multi-line Error Correction. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC13, pages 1--12, 2013. Google ScholarDigital Library
Y. Son, S. O, Y. Ro, J. Lee, and J. Ahn. Reducing Memory Access Latency with Asymmetric DRAM Bank Organizations. In 2013 40th Annual International Symposium on Computer Architecture (ISCA), volume 41, pages 380--391, 2013. Google ScholarDigital Library
B. Jacob, S. Ng, and D. Wang. Memory Systems Cache, DRAM, Disk. Morgan Kaufmann; first edition, 2007. Google ScholarDigital Library
V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi. Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC13, pages 1--11, 2013. Google ScholarDigital Library
N. DeBardeleben, S. Blanchard, V. Sridharan, S. Gurumurthi, J. Stearley, K. Ferreira, and J. Shalf. Extra Bits on SRAM and DRAM Errors - More Data From the Field. Silicon Errors in Logic - System Effects (SELSE-10), Stanford University, Apr. 2014.Google Scholar
V. Sridharan, N. Debardeleben, S. Blanchard, K. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi. Memory Errors in Modern Systems: The Good, The Bad and The Ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 297--310, 2015. Google ScholarDigital Library
S. Evain and V. Gherman. Error-correction schemes with erasure information for fast memories. In Test Symposium (ETS), 2013 18th IEEE European, pages 1--6, May 2013.Google ScholarCross Ref
D. Yoon, J. Chang, N. Muralimanohar, and P. Ranganathan. BOOM: Enabling mobile memory based low-power server DIMMs. In 39th Annual International Symposium on Computer Architecture (ISCA), pages 25--36, Jun. 2012. Google ScholarDigital Library
AMD64 architecture programmer's manual revision 3.17. 2011.Google Scholar
Reliability, Availibility, and Serviceability. Features of the IBM eX5 Portfolio. 2012.Google Scholar
Intel Xeon Processor E7 Family: Reliability, Availability and Serviceability: Advanced data integrity and resiliency support for mission-critical deployment. 2011.Google Scholar
HP: How memory RAS technologies can enhance the uptime of HP Proliant servers. 2013.Google Scholar
M. Sullivan J. Kim and M. Erez. Bamboo ECC: Strong, Safe, and Flexible Codes for Reliable Computer Memory. In IEEE International Symposium on High Performance Computer Architecture, HPCA, pages 101--112, 2015.Google ScholarCross Ref
Memory for Dell Poweredge 12th Generation Servers. 2012.Google Scholar
N. Binkert, B. Beckmann, G. Black, S. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. Hill, and D. Wood. The Gem5 Simulator. SIGARCH Computer Archittecture News, 39(2), Aug. 2011. Google ScholarDigital Library
P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters, 10(1):16--19, Jan. 2011. Google ScholarDigital Library
Micron. DDR3 SDRAM System-Power Calculator, 2011.Google Scholar
E. Perelman, G. Hamerly, M. Biesbrouck, T. Sherwood, and B. Calder. Using Si'mPoint for Accurate and Efficient Simulation. In Proceedings of the 2003 ACM International Conference on Measurement and Modeling of Computer Systems, In SIGMETRICS, 2003. Google ScholarDigital Library
H. Jeon, G. Loh, and M. Annavaram. Efficient RAS support for die-stacked DRAM. In IEEE International Test Conference (ITC), pages 1--10, Oct. 2014.Google ScholarCross Ref
R.-H. Deng and D. J. Costello. Decoding of DBEC-TBED Reed-Solomon Codes. IEEE Transactions on Computers, C-36(11):1359--1363, 1987. Google ScholarDigital Library
S. Fenn, M. Benaissa, and D. Taylor. Decoding double-error-correcting Reed-Solomon codes. IEE Proceedings on Communications, 142(6):345--348, 1995.Google ScholarCross Ref
R. Berlekamp. Algebraic Coding Theory. McGraw-Hill, 1968.Google Scholar

Index Terms

E-ECC: Low Power Erasure and Error Correction Schemes for Increasing Reliability of Commodity DRAM Systems

Recommendations

Virtualized and flexible ECC for main memory
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

We present a general scheme for virtualizing main memory error-correction mechanisms, which map redundant information needed to correct errors into the memory namespace itself. We rely on this basic idea, which increases flexibility to increase error ...
Read More
Frugal ECC: efficient and versatile memory error protection through fine-grained compression
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Because main memory is vulnerable to errors and failures, large-scale systems and critical servers utilize error checking and correcting (ECC) mechanisms to meet their reliability requirements. We propose a novel mechanism, Frugal ECC (FECC), that ...
Read More
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design
ASPLOS '12

Main memory is one of the leading hardware causes for machine crashes in today's datacenters. Designing, evaluating and modeling systems that are resilient against memory errors requires a good understanding of the underlying characteristics of errors ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

MEMSYS '15: Proceedings of the 2015 International Symposium on Memory Systems
October 2015
278 pages
ISBN:9781450336048
DOI:10.1145/2818950

Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 October 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Chipkill-Correct
DRAM Memory System
DRAM errors
Erasure and Error Correction
Reliability
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 248
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

E-ECC: Low Power Erasure and Error Correction Schemes for Increasing Reliability of Commodity DRAM Systems

MEMSYS '15: Proceedings of the 2015 International Symposium on Memory Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Virtualized and flexible ECC for main memory

Frugal ECC: efficient and versatile memory error protection through fine-grained compression

Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

E-ECC: Low Power Erasure and Error Correction Schemes for Increasing Reliability of Commodity DRAM Systems

MEMSYS '15: Proceedings of the 2015 International Symposium on Memory Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Virtualized and flexible ECC for main memory

Frugal ECC: efficient and versatile memory error protection through fine-grained compression

Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media