Evaluating and Accelerating High-Fidelity Error Injection for HPC

We address two important concerns in the analysis of the behavior of applications in the presence of hardware errors: (1) when is it important to model how hardware faults lead to erroneous values (instruction-level errors) with high fidelity, as opposed to using simple bit-flipping models, and (2) how to enable fast high-fidelity error injection campaigns, in particular when error detectors are employed. We present and verify a new nested Monte Carlo methodology for evaluating high-fidelity gate-level fault models and error-detector coverage, which is orders of magnitude faster than current approaches. We use that methodology to demonstrate that, without detectors, simple error models suffice for evaluating errors in 9 HPC benchmarks.

Authors: 
Chun-Kai Chang (The University of Texas at Austin)
Sangkug Lym (The University of Texas at Austin)
Nicholas Kelly (The University of Texas at Austin)
Mattan Erez (The University of Texas at Austin)
Publication Date: 
Sunday, November 11, 2018