skip to main content
10.1145/3126908.3126960acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Experimental and analytical study of Xeon Phi reliability

Published: 12 November 2017 Publication History

Abstract

We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection. Besides measuring the realistic error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) by correlating the distribution of corrupted elements in the output to the application's characteristics. We evaluate the benefits of imprecise computing for reducing the programs' error rate. For example, for HotSpot a 0.5% tolerance in the output value reduces the error rate by 85%.
We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. For example, faults occurring in the middle of LUD execution, or in the Sort and Tree portions of CLAMR, are more critical than the remaining portions. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions.

References

[1]
C. A. Argyrides, C. A. Lisboa, D. K. Pradhan, and L. Cairo. 2009. Single element correction in sorting algorithms with minimum delay overhead. In 2009 10th Latin American Test Workshop. 1--6.
[2]
Krste Asanovic et al. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
[3]
Rizwan A. Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald F. DeMara, Chen-Yong Cher, and Pradip Bose. 2015. Understanding the Propagation of Transient Errors in HPC Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 72, 12 pages.
[4]
R.C. Baumann. 2005. Radiation-induced soft errors in advanced semiconductor technologies. Device and Materials Reliability, IEEE Transactions on 5, 3 (Sept 2005), 305--316.
[5]
R. Baumann. 2005. Soft errors in advanced computer systems. Design Test of Computers, IEEE 22, 3 (2005).
[6]
Eduardo Berrocal, Leonardo Bautista-Gomez, Sheng Di, Zhiling Lan, and Franck Cappello. 2016. Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications. Springer International Publishing, Cham, 419--430.
[7]
Melvin A Breuer. 2005. Multi-media applications and imprecise computation. In Digital System Design, 2005. Proceedings. 8th Euromicro Conference on. IEEE, 2--7.
[8]
S. Buchner, M. Baze, D. Brown, D. McMorrow, and J. Melinger. 1997. Comparison of error rates in combinational and sequential logic. Nuclear Science, IEEE Transactions on 44, 6 (1997), 2209--2216.
[9]
Carol-FI Fault Injector 2017. Carol-FI. https://github.com/UFRGS-CAROL/carol-fi. (2017).
[10]
Shuai Che, M. Boyer, Jiayuan Meng, D. Tarjan, J. W. Sheaffer, Sang-Ha Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. 44--54.
[11]
H. M. Chen, S. Jeloka, A. Arunkumar, D. Blaauw, C. J. Wu, T. Mudge, and C. Chakrabarti. 2016. Using Low Cost Erasure and Error Correction Schemes to Improve Reliability of Commodity DRAM Systems. IEEE Trans. Comput. 65, 12 (Dec 2016), 3766--3779.
[12]
Chen-Yong Cher, Meeta S. Gupta, Pradip Bose, and K. Paul Muller. 2014. Understanding Soft Error Resiliency of BlueGene/Q Compute Chip Through Hardware Proton Irradiation and Software Fault Injection. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 587--596.
[13]
C. Constantinescu. 2002. Impact of deep submicron technology on dependability of VLSI circuits. In Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on. 205--209.
[14]
Josep de la Puente, Miguel Ferrer, Mauricio Hanzich, JosÃl' E. Castillo, and JosÃl' M. Cela. 2014. Mimetic seismic wave modeling including topography on deformed staggered grids. GEOPHYSICS 79, 3 (2014), T125--T141.
[15]
D. A. G. de Oliveira, L. L. Pilla, T. Santini, and P. Rech. 2016. Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units. IEEE Trans. Comput. 65, 3 (March 2016), 791--804.
[16]
Nathan DeBardeleben, Sean Blanchard, Laura Monroe, Phil Romero, Daryl Grünau, Craig Idler, and Cornwell Wright. 2013. GPU Behavior on a Large HPC Cluster. 6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the 19th International European Conference on Parallel and Distributed Computing (Euro-Par 2013), Aachen, Germany, (August 26-30 2013).
[17]
J.J. Dongarra, H.W. Meuer, and E. Strohmaier. 2016. TOP500 Supercomputer Sites: June 2016. (2016). http://www.top500.org
[18]
B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi. 2014. GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on. 221--230.
[19]
Bo Fang, Karthik Pattabiraman, Matei Ripeanu, and Sudhanva Gurumurthi. 2014. Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on. IEEE, 221--230.
[20]
Y. P. Fang and A. S. Oates. 2016. Characterization of Single Bit and Multiple Cell Soft Error Events in Planar and FinFET SRAMs. IEEE Transactions on Device and Materials Reliability 16, 2 (June 2016), 132--137.
[21]
L. A. Bautista Gomez, Franck Cappello, L. Carro, N. DeBardeleben, B. Fang, S. Gurumurthi, S. Keckler, K. Pattabiraman, R. Rech, and M. Sonza Reorda. 2014. GPGPUs: How to Combine High Computational Power with High Reliability. In 2014 Design Automation and Test in Europe Conference and Exhibition. Dresden, Germany.
[22]
Qiang Guan, N. DeBardeleben, B. Artkinson, R. Robey, and W.M. Jones. 2015. Towards Building Resilient Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with the CLAMR Hydrodynamics Mini-App. In Cluster Computing (CLUSTER), 2015 IEEE International Conference on. 176--179.
[23]
Siva Kumar Sastry Hari et al. 2015. SASSIFI: Evaluating Resilience of GPU Applications. In Proceedings of the Workshop on Silicon Errors in Logic-System Effects (SELSE).
[24]
Siva Kumar Sastry Hari, Timothy Tsai, Mark Stephenson, Stephen W. Keckler, and Joel Emer. 2017. SASSIFI: An Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation. International Symposium on Performance Analysis of Systems and Software (Oct 2017).
[25]
S. M. Hassan, W. J. Song, S. Mukhopadhyay, and S. Yalamanchili. 2016. Reliability- performance tradeoffs between 2.5D and 3D-stacked DRAM processors. In 2016 IEEE International Reliability Physics Symposium (IRPS). MY-2--1--MY--2--5.
[26]
Kuang-Hua Huang and J.A. Abraham. 1984. Algorithm-Based Fault Tolerance for Matrix Operations. Computers, IEEE Transactions on C-33, 6 (June 1984), 518--528.
[27]
Intel. 2016. An Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors. (2016). http://download.intel.com/newsroom/kits/xeon/phi/pdfs/overview-programming-intel-xeon-intel-xeon-phi-coprocessors.pdf
[28]
Intel. 2016. Intel Xeon Phi Coprocessor System Software Developers Guide. (2016). https://software.intel.com/sites/default/files/managed/09/07/xeon-phi-coprocessor-system-software-developers-guide.pdf
[29]
JEDEC. 2006. Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices. Technical Report JESD89A. JEDEC Standard.
[30]
Saeng-Hwan Kim, Won-Oh Lee, Jung-Ho Kim, Seong-Seop Lee, Sun-Young Hwang, Chang-Il Kim, Tae-Woo Kwon, Bong-Seok Han, Sung-Kwon Cho, Dae-Hui Kim, Jae-Keun Hong, Min-Yung Lee, Sung-Wook Yin, Hyeon-Gon Kim, Jin-Hong Ahn, Yong-Tark Kim, Yo-Hwan Koh, and Joong-Sik Kih. 2007. A low power and highly reliable 400Mbps mobile DDR SDRAM with on-chip distributed ECC. In Solid-State Circuits Conference, 2007. ASSCC '07. IEEE Asian. 34--37.
[31]
J. C. Laprie. 1995. DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY. In Fault-Tolerant Computing, 1995, Highlights from Twenty-Five Years., Twenty-Fifth International Symposium on. 2--.
[32]
Dong Li, Jeffrey S. Vetter, and Weikuan Yu. 2012. Classifying Soft Error Vulnerabilities in Extreme-scale Scientific Applications Using a Binary Instrumentation Tool. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 57, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389074
[33]
Guanpeng Li, Karthik Pattabiraman, Chen-Yong Cher, and Pradip Bose. 2016. Understanding Error Propagation in GPGPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE Press, Piscataway, NJ, USA, Article 21, 12 pages. http://dl.acm.org/citation.cfm?id=3014904.3014932
[34]
Robert Lucas. 2014. Top Ten Exascale Research Challenges. In DOE ASCAC Subcommittee Report.
[35]
R. R. Lutz. 1993. Analyzing software requirements errors in safety-critical, embedded systems. In Requirements Engineering, 1993., Proceedings of IEEE International Symposium on. 126--133.
[36]
N.N. Mahatme, T.D. Jagannathan, L.W. Massengill, B.L. Bhuva, S.-J. Wen, and R. Wong. 2011. Comparison of Combinational and Sequential Error Rates for a Deep Submicron Process. Nuclear Science, IEEE Transactions on 58, 6 (2011), 2719--2725.
[37]
M. Nicolaidis. 1999. Time redundancy based soft-error tolerance to rescue nanometer technologies. In VLSI Test Symposium, 1999. Proceedings. 17th IEEE. 86--94.
[38]
D. Oliveira, L. Pilla, M. Hanzich, V. Fratin, F Fernandes, C. Lunardi, J. Cela, P. Navaux, L. Carro, and P. Rech. 2017. Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators. In Proceedings of 21st IEEE Symp. on High Performance Computer Architecture (HPCA). ACM.
[39]
H. Quinn, W. H. Robinson, P. Rech, M. Aguirre, A. Barnard, M. Desogus, L. Entrena, M. Garcia-Valderas, S. M. Guertin, D. Kaeli, F. L. Kastensmidt, B. T. Kiddie, A. Sanchez-Clemente, M. S. Reorda, L. Sterpone, and M. Wirthlin. 2015. Using Benchmarks for Radiation Testing of Microprocessors and FPGAs. IEEE Transactions on Nuclear Science 62, 6 (Dec 2015), 2547--2554.
[40]
Radiation Experiment Results 2017. Radiation Experiment Results. https://github.com/UFRGS-CAROL/sc17-log-data. (2017).
[41]
P. Rech, C. Aguiar, C. Frost, and L. Carro. 2013. An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs. Nuclear Science, IEEE Transactions on 60, 4 (2013), 2797--2804.
[42]
Advanced Scientific Computing Research. 2016. Scientific Discovery through Advanced Computing - The Challenges of Exascale. https://science.energy.gov/ascr/research/scidac/exascale-challenges/. (2016). {Online; accessed 5-March-2016}.
[43]
G. P. Saggese, N. J. Wang, Z. T. Kalbarczyk, S. J. Patel, and R. K. Iyer. 2005. An experimental study of soft errors in microprocessors. IEEE Micro 25, 6 (Nov 2005), 30--39.
[44]
Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM Errors in the Wild: A Large-Scale Field Study. In SIGMETRICS.
[45]
Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, et al. 2014. Addressing failures in exascale computing. International Journal of High Performance Computing Applications (2014), 1--45.
[46]
Devesh Tiwari, Saurabh Gupta, Jim Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan Debardeleben, Philippe Navaux, Luigi Carro, and Arthur Buddy Bland. 2015. Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation. In Proceedings of 21st IEEE Symp. on High Performance Computer Architecture (HPCA). ACM.
[47]
S. Tselonis and D. Gizopoulos. 2016. GUFI: A framework for GPUs reliability assessment. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 90--100.
[48]
M. Violante, L. Sterpone, A. Manuzzato, S. Gerardin, P. Rech, M. Bagatin, A. Paccagnella, C. Andreani, G. Gorini, A. Pietropaolo, G. Cardarilli, S. Pontarelli, and C. Frost. 2007. A New Hardware/Software Platform and a New 1/E Neutron Source for Soft Error Studies: Testing FPGAs at the ISIS Facility. Nuclear Science, IEEE Transactions on 54, 4 (2007), 1184--1189.
[49]
H.-J. Wunderlich, C. Braun, and S. Halder. 2013. Efficacy and efficiency of algorithm-based fault-tolerance on GPUs. In On-Line Testing Symposium (IOLTS), 2013 IEEE 19th International. 240--243.

Cited By

View all
  • (2025)TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUsProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710853(70-84)Online publication date: 28-Feb-2025
  • (2024)Assessing the Impact of Compiler Optimizations on GPUs ReliabilityACM Transactions on Architecture and Code Optimization10.1145/363824921:2(1-22)Online publication date: 12-Jan-2024
  • (2024)FT K-Means: A High-Performance K-Means on GPU with Fault Tolerance2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00035(322-334)Online publication date: 24-Sep-2024
  • Show More Cited By

Index Terms

  1. Experimental and analytical study of Xeon Phi reliability

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
      November 2017
      801 pages
      ISBN:9781450351140
      DOI:10.1145/3126908
      • General Chair:
      • Bernd Mohr,
      • Program Chair:
      • Padma Raghavan
      © 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Sponsors

      In-Cooperation

      • IEEE CS

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 November 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. fault injection
      2. fault tolerance
      3. parallel architectures
      4. radiation experiments
      5. reliability

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      SC '17
      Sponsor:

      Acceptance Rates

      SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;
      Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)24
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUsProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710853(70-84)Online publication date: 28-Feb-2025
      • (2024)Assessing the Impact of Compiler Optimizations on GPUs ReliabilityACM Transactions on Architecture and Code Optimization10.1145/363824921:2(1-22)Online publication date: 12-Jan-2024
      • (2024)FT K-Means: A High-Performance K-Means on GPU with Fault Tolerance2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00035(322-334)Online publication date: 24-Sep-2024
      • (2024)Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUsThe Journal of Supercomputing10.1007/s11227-024-05925-0Online publication date: 22-Feb-2024
      • (2023)A Systematic Literature Review on Hardware Reliability Assessment Methods for Deep Neural NetworksACM Computing Surveys10.1145/363824256:6(1-39)Online publication date: 23-Dec-2023
      • (2023)Predicting GPU Failures With High Precision Under Deep Learning WorkloadsProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594777(124-135)Online publication date: 5-Jun-2023
      • (2023)Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593715(360-372)Online publication date: 21-Jun-2023
      • (2023)FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.331601134:12(3207-3223)Online publication date: Dec-2023
      • (2023)Characterizing a Neutron-Induced Fault Model for Deep Neural NetworksIEEE Transactions on Nuclear Science10.1109/TNS.2022.322453870:4(370-380)Online publication date: Apr-2023
      • (2023)Testability and Dependability of AI Hardware: Survey, Trends, Challenges, and PerspectivesIEEE Design & Test10.1109/MDAT.2023.324111640:2(8-58)Online publication date: Apr-2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media