Evaluating and Accelerating High-Fidelity Error Injection for HPC

We address two important concerns in the analysis of the behavior of applications in the presence of hardware errors: (1) when is it important to model how hardware faults lead to erroneous values (instruction-level errors) with high fidelity, as opposed to using simple bit-flipping models, and (2) how to enable fast high-fidelity error injection campaigns, in particular when error detectors are employed. We present and verify a new nested Monte Carlo methodology for evaluating high-fidelity gate-level fault models and error-detector coverage, which is orders of magnitude faster than current approaches. We use that methodology to demonstrate that, without detectors, simple error models suffice for evaluating errors in 9 HPC benchmarks.

Authors

Chun-Kai Chang (The University of Texas at Austin)

Sangkug Lym (The University of Texas at Austin)

Nicholas Kelly (The University of Texas at Austin)

Michael B. Sullivan

Mattan Erez (The University of Texas at Austin)

Publication Date

Sunday, November 11, 2018

Published in

The International Conference on High Performance Computing, Networking, Storage…

Research Area

High Performance Computing

Resilience and Safety

External Links

IEEE Digital Library

Uploaded Files

Published manuscript2.35 MB

Copyright

This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org.