ABSTRACT
We have already known for a long time that hardware components are not perfect and soft errors in terms of single bit flips happen all the time. Up to now, these single bit flips are mainly addressed in hardware using general-purpose protection techniques. However, recent studies have shown that all future hardware components become less and less reliable in total and multi-bit flips are occurring regularly rather than exceptionally. Additionally, hardware aging effects will lead to error models that change during run-time. Scaling hardware-based protection techniques to cover changing multi-bit flips is possible, but this introduces large performance, chip area, and power overheads, which will become non-affordable in the future. To tackle that, an emerging research direction is employing protection techniques in higher software layers like compilers or applications. The available knowledge at these layers can be efficiently used to specialize and adapt protection techniques. Thus, we propose a novel adaptable and on-the-fly hardware error detection approach called AHEAD for database systems in this paper. AHEAD provides configurable error detection in an end-to-end fashion and reduces the overhead (storage and computation) compared to other techniques at this level. Our approach uses an arithmetic error coding technique which allows query processing to completely work on hardened data on the one hand. On the other hand, this enables on-the-fly detection during query processing of (i) errors that modify data stored in memory or transferred on an interconnect and (ii) errors induced during computations. Our exhaustive evaluation clearly shows the benefits of our AHEAD approach.
- Daniel J. Abadi, Samuel Madden, and Miguel Ferreira. "Integrating compres- sion and execution in column-oriented database systems". In: SIGMOD . 2006, pp. 671--682. Google ScholarDigital Library
- Daniel Abadi et al. "The Beckman report on database research". In: Commun. ACM 59.2 (2016), pp. 92--99. Google ScholarDigital Library
- Daniel Abadi et al. "The Design and Implementation of Modern Column- Oriented Database Systems". In: Foundations and Trends in Databases 5.3 (2013), pp. 197--280. Google ScholarDigital Library
- Algirdas Avizienis. "Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital System Design". In: IEEE Trans. Computers 20.11 (1971), pp. 1322--1331. Google ScholarDigital Library
- Algirdas Avizienis. "The N-Version Approach to Fault-Tolerant Software". In: IEEE Trans. Software Eng. 11.12 (1985), pp. 1491--1501. Google ScholarDigital Library
- Carsten Binnig, Stefan Hildenbrand, and Franz Färber. "Dictionary-based order- preserving string compression for main memory column stores". In: SIGMOD . 2009, pp. 283--296. Google ScholarDigital Library
- Matthias Böhm, Wolfgang Lehner, and Christof Fetzer. "Resiliency-Aware Data Management". In: PVLDB 4.12 (2011), pp. 1462--1465.Google Scholar
- Peter A. Boncz and Martin L. Kersten. "MIL Primitives for Querying a Frag- mented World". In: VLDB J. 8.2 (1999), pp. 101--119. Google ScholarDigital Library
- Peter A. Boncz, Martin L. Kersten, and Stefan Manegold. "Breaking the memory wall in MonetDB". In: Commun. ACM 51.12 (2008), pp. 77--85. Google ScholarDigital Library
- Peter Alexander Boncz. "Monet; a next-Generation DBMS Kernel For Query- Intensive Applications". PhD thesis. University of Amsterdam, 2002.Google Scholar
- Shekhar Y. Borkar. "Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation". In: IEEE Micro 25.6 (2005), pp. 10--16. Google ScholarDigital Library
- Shekhar Borkar and Andrew A. Chien. "The future of microprocessors". In: Commun. ACM 54.5 (2011), pp. 67--77. Google ScholarDigital Library
- Sebastian Breß, Henning Funke, and Jens Teubner. "Robust Query Processing in Co-Processor-accelerated Databases". In: SIGMOD . 2016, pp. 1891--1906. Google ScholarDigital Library
- George P. Copeland and Setrag Khoshafian. "A Decomposition Storage Model". In: SIGMOD . 1985, pp. 268--279. Google ScholarDigital Library
- Patrick Damme et al. "Lightweight Data Compression Algorithms: An Experi- mental Survey (Experiments and Analyses)". In: EDBT . 2017, pp. 72--83.Google Scholar
- Timothy J Dell. "A white paper on the benefits of chipkill-correct ECC for PC server main memory". In: IBM Microelectronics Division 11 (1997).Google Scholar
- Cristian Diaconu et al. "Hekaton: SQL server's memory-optimized OLTP en- gine". In: SIGMOD . 2013, pp. 1243--1254. Google ScholarDigital Library
- Jaeyoung Do et al. "Query processing on smart SSDs: opportunities and chal- lenges". In: SIGMOD . 2013, pp. 1221--1230. Google ScholarDigital Library
- Dan Ernst et al. "Razor: circuit-level correction of timing errors for low-power operation". In: IEEE Micro 24.6 (2004), pp. 10--20. Google ScholarDigital Library
- Hadi Esmaeilzadeh et al. "Dark Silicon and the End of Multicore Scaling". In: IEEE Micro 32.3 (2012), pp. 122--134. Google ScholarDigital Library
- Ziqiang Feng et al. "ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout". In: SIGMOD . 2015, pp. 31--46. Google ScholarDigital Library
- P Forin. "Vital Coded Microprocessor: Principles and Application for Various Transit Systems". In: IFAC-GCCT (1989).Google Scholar
- Free Software Foundation. The GNU Multiple Precision Arithmetic Library . https://gmplib.org/. Nov. 2016.Google Scholar
- Brian Gladman et al. MPIR: Multiple Precision Integers and Rationals . http : //mpir.org/. Nov. 2016.Google Scholar
- Olga Goloubeva et al. Software-implemented hardware fault tolerance . Springer Science &Business Media, 2006. Google ScholarDigital Library
- Richard W Hamming. "Error detecting and error correcting codes". In: Bell System technical journal 29.2 (1950).Google Scholar
- Jörg Henkel. "Emerging Memory Technologies". In: IEEE Design &Test 34.3 (2017), pp. 4--5.Google ScholarCross Ref
- Jörg Henkel et al. "Reliable on-chip systems in the nano-era: lessons learnt and future trends". In: DAC . 2013, 99:1--99:10. Google ScholarDigital Library
- Martin Hoffmann et al. "A Practitioner's Guide to Software-Based Soft-Error Mitigation Using AN-Codes". In: HASE . 2014, pp. 33--40. Google ScholarDigital Library
- Andy A. Hwang, Ioan A. Stefanovici, and Bianca Schroeder. "Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design". In: ASPLOS . 2012, pp. 111--122. Google ScholarDigital Library
- Eishi Ibe et al. "Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule". In: IEEE Transactions on Electron Devices 57.7 (2010), pp. 1527--1538.Google ScholarCross Ref
- Stratos Idreos et al. "MonetDB: Two Decades of Research in Column-oriented Database Architectures". In: IEEE Data Eng. Bull. 35.1 (2012), pp. 40--45.Google Scholar
- K Itoh et al. "A single 5V 64K dynamic RAM". In: ISSCC . Vol. 23. 1980, pp. 228-- 229.Google Scholar
- Lei Jiang, Youtao Zhang, and Jun Yang. "Mitigating Write Disturbance in Super- Dense Phase Change Memories". In: DSN . 2014, pp. 216--227. Google ScholarDigital Library
- Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. "Adaptive Work Place- ment for Query Processing on Heterogeneous Computing Resources". In: PVLDB 10.7 (2017), pp. 733--744. Google ScholarDigital Library
- D. Kaur and D. Wedding. "Reliability of Hamming code transmission versus error probability on message bits". In: Microelectronics Reliability 34.7 (1994).Google ScholarCross Ref
- Samira Manabi Khan, Donghyuk Lee, and Onur Mutlu. "PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM". In: DSN . 2016, pp. 239--250.Google Scholar
- Samira Khan et al. "The Efficacy of Error Mitigation Techniques for DRAM Re- tention Failures: A Comparative Experimental Study". In: SIGMETRICS Perform. Eval. Rev. 42.1 (June 2014), pp. 519--532. Google ScholarDigital Library
- Jangwoo Kim et al. "Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding". In: Symposium on Microarchitecture . 2007, pp. 197--209. Google ScholarDigital Library
- Yoongu Kim et al. "Flipping bits in memory without accessing them: An exper- imental study of DRAM disturbance errors". In: ISCA . 2014, pp. 361--372. Google ScholarDigital Library
- Thomas Kissinger et al. "QPPT: Query Processing on Prefix Trees". In: CIDR . 2013.Google Scholar
- Masanobu Kohara et al. "Mechanism of electromigration in ceramic packages induced by chip-coating polyimide". In: IEEE Transactions on Components, Hybrids, and Manufacturing Technology 13.4 (1990), pp. 873--878.Google ScholarCross Ref
- Till Kolditz et al. "Online bit flip detection for in-memory B-trees on unreliable hardware". In: DaMoN . 2014, 5:1--5:9. Google ScholarDigital Library
- Emre Kultursay et al. "Evaluating STT-RAM as an energy-efficient main mem- ory alternative". In: ISPASS . 2013, pp. 256--267.Google Scholar
- Tirthankar Lahiri, Marie-Anne Neimat, and Steve Folkman. "Oracle TimesTen: An In-Memory Database for Enterprise Applications". In: IEEE Data Eng. Bull. 36.2 (2013), pp. 6--13.Google Scholar
- Benjamin C. Lee et al. "Architecting phase change memory as a scalable dram alternative". In: ISCA . 2009, pp. 2--13. Google ScholarDigital Library
- Christiane Lemieux. Monte Carlo and Quasi-Monte Carlo Sampling . Springer, 2009. isbn : 978--1441926760.Google Scholar
- Feng Li et al. "Accelerating Relational Databases by Leveraging Remote Mem- ory and RDMA". In: SIGMOD . 2016, pp. 355--370. Google ScholarDigital Library
- Yinan Li and Jignesh M. Patel. "BitWeaving: Fast Scans for Main Memory Data Processing". In: SIGMOD . 2013, pp. 289--300. Google ScholarDigital Library
- Jamie Liu et al. "An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms". In: SIGARCH Comput. Archit. News 41.3 (June 2013), pp. 60--71. Google ScholarDigital Library
- Sparsh Mittal. "A Survey of Soft-Error Mitigation Techniques for Non-Volatile Memories". In: Computers 6.1 (2017), p. 8.Google ScholarCross Ref
- Todd K Moon. "Error correction coding". In: Mathematical Methods and Algo- rithms. Jhon Wiley and Son (2005). Google ScholarDigital Library
- Wojciech Mula, Nathan Kurz, and Daniel Lemire. "Faster Population Counts using AVX2 Instructions". In: CoRR (2016).Google Scholar
- Onur Mutlu. "The RowHammer problem and other issues we may face as memory becomes denser". In: DATE . 2017, pp. 1116--1121. Google ScholarDigital Library
- Thomas Neumann. The price of correctness . http://databasearchitects.blogspot. de/2015/12/the-price-of-correctness.html. Nov. 2016.Google Scholar
- Patrick O'Neil et al. "The Star Schema Benchmark and Augmented Fact Table Indexing". In: TPCTC 2009: Performance Evaluation and Benchmarking . Berlin, Heidelberg: Springer, 2009, pp. 237--252. Google ScholarDigital Library
- Nahmsuk Oh, Philip P Shirvani, and Edward J McCluskey. "Error detection by duplicated instructions in super-scalar processors". In: IEEE Transactions on Reliability 51.1 (2002), pp. 63--75.Google ScholarCross Ref
- Ismail Oukid et al. "FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory". In: SIGMOD . 2016, pp. 371--386. Google ScholarDigital Library
- William Wesley Peterson and Daniel T Brown. "Cyclic codes for error detec- tion". In: IRE 49.1 (1961), pp. 228--235.Google ScholarCross Ref
- Frank M. Pittelli and Hector Garcia-Molina. "Database Processing with Triple Modular Redundancy". In: SRDS . 1986, pp. 95--103.Google Scholar
- Frank M. Pittelli and Hector Garcia-Molina. "Reliable Scheduling in a TMR Database System". In: ACM Trans. Comput. Syst. 7.1 (1989), pp. 25--60. Google ScholarDigital Library
- Fred J. Pollack. "New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies". In: Symposium on Microarchitecture . 1999, p. 2. Google ScholarDigital Library
- Semeen Rehman, Muhammad Shafique, and Jörg Henkel. Reliable Software for Unreliable Hardware - A Cross Layer Perspective . Springer, 2016. Google ScholarDigital Library
- Steven K. Reinhardt and Shubhendu S. Mukherjee. "Transient fault detection via simultaneous multithreading". In: ISCA . 2000, pp. 25--36. Google ScholarDigital Library
- George A. Reis et al. "SWIFT: Software Implemented Fault Tolerance". In: CGO . 2005, pp. 243--254. Google ScholarDigital Library
- Michael C. Ring. MAPM, A Portable Arbitrary Precision Math Library in C . http://www.tc.umn.edu/~ringx004/mapm-main.html. Nov. 2016.Google Scholar
- Ronald Linn Rivest. The MD5 Message-Digest Algorithm . Nov. 2016. url : https: //tools.ietf.org/html/rfc1321.Google Scholar
- Jimi Sanchez. "A Review of Star Schema Benchmark". In: CoRR abs/1606.00295 (2016).Google Scholar
- Ute Schiffel. "Hardware error detection using AN-Codes". PhD thesis. Dresden University of Technology, 2011Google Scholar
- Muhammad Shafique et al. "Multi-layer software reliability for unreliable hardware". In: it - Information Technology 57.3 (2015), pp. 170--180.Google ScholarCross Ref
- Erez Shmueli et al. "Database encryption: an overview of contemporary chal- lenges and design considerations". In: SIGMOD Record 38.3 (2009), pp. 29-- 34. Google ScholarDigital Library
- Konstantin Shvachko et al. "The Hadoop Distributed File System". In: MSST . 2010, pp. 1--10. Google ScholarDigital Library
- Gopalan Sivathanu, Charles P. Wright, and Erez Zadok. "Ensuring Data In- tegrity in Storage: Techniques and Applications". In: StorageSS . 2005. Google ScholarDigital Library
- Michael Spica and T. M. Mak. "Do We Need Anything More Than Single Bit Error Correction (ECC)?" In: MTDT . 2004, pp. 111--116. Google ScholarDigital Library
- Michael Stonebraker et al. "C-Store: A Column-oriented DBMS". In: VLDB . 2005, pp. 553--564. Google ScholarDigital Library
- Stephen Y. H. Su and Edgar DuCasse. "A hardware redundancy reconfigura- tion scheme for tolerating multiple module failures". In: IEEE Transactions on Computers 3.C-29 (1980), pp. 254--258. Google ScholarDigital Library
- Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. "Simultaneous Multi- threading: Maximizing On-Chip Parallelism". In: ISCA . 1995, pp. 392--403. Google ScholarDigital Library
- Peter Ulbrich, Martin Hoffmann, and Christian Dietrich. CoRed: Experimental Results . https://www4.cs.fau.de/Research/CoRed/experiments/. July 2017.Google Scholar
- Peter Ulbrich et al. "Eliminating single points of failure in software-based redundancy". In: EDCC . 2012, pp. 49--60. Google ScholarDigital Library
- Henry S Warren. Hacker's delight . Pearson Education, 2013. Google ScholarDigital Library
- Matthias Werner et al. "Multi-GPU Approximation for Silent Data Corruption of AN Codes". In: Further Improvements in the Boolean Domain . Ed. by Bernd Steinbach. Cambridge Scholars Publishing, 2018. Chap. 2.3, pp. 136--155.Google Scholar
- Thomas Willhalm et al. "SIMD-scan: Ultra Fast In-memory Table Scan Using On-chip Vector Processing Units". In: Proc. VLDB Endow. (2009). Google ScholarDigital Library
- Thomas Willhalm et al. "Vectorizing database column scans with complex predicates". In: ADMS . 2013, pp. 1--12.Google Scholar
- Ian H. Witten, Radford M. Neal, and John G. Cleary. "Arithmetic Coding for Data Compression". In: Commun. ACM 30.6 (1987), pp. 520--540. Google ScholarDigital Library
- J. Wolf, A. Michelson, and A. Levesque. "On the Probability of Undetected Error for Linear Block Codes". In: IEEE Transactions on Communications 30.2 (1982).Google ScholarCross Ref
- H.-S. Philip Wong et al. "Metal-Oxide RRAM". In: Proceedings of the IEEE 100.6 (2012), pp. 1951--1970.Google ScholarCross Ref
- Marcin Zukowski, Mark van de Wiel, and Peter A. Boncz. "Vectorwise: A Vectorized Analytical DBMS". In: ICDE . 2012, pp. 1349--1350. Google ScholarDigital Library
- Marcin Zukowski et al. "Super-Scalar RAM-CPU Cache Compression". In: ICDE . 2006, p. 59 Google ScholarDigital Library
Index Terms
- AHEAD: Adaptable Data Hardening for On-the-Fly Hardware Error Detection during Database Query Processing
Recommendations
Trading Fault Tolerance for Performance in AN Encoding
CF'17: Proceedings of the Computing Frontiers ConferenceIncreasing rates of transient hardware faults pose a problem for computing applications. Current and future trends are likely to exacerbate this problem. When a transient fault occurs during program execution, data in the output can become corrupted. ...
Go Ahead: A Partial Reconfiguration Framework
FCCM '12: Proceedings of the 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing MachinesExploiting the benefits of partial run-time reconfiguration requires efficient tools. In this paper, we introduce the tool Go Ahead that is able to implement run-time reconfigurable systems for all recent Xilinx FPGAs. This includes in particular ...
Reliable In-Memory Data Management on Unreliable Hardware
DATA 2018: Proceedings of the 7th International Conference on Data Science, Technology and ApplicationsThe key objective of database systems is to reliably manage data, whereby high query throughput and low query latency are core requirements. To satisfy these requirements, database systems constantly adapt to novel hardware features. Although it has ...
Comments