Towards Analytically Evaluating the Error Resilience of GPU Programs

General purpose Graphics Processing Units (GPUs) have become popular for many reliability-conscious uses including their use for high-performance computation, machine learning algorithms, and business analytics workloads. Fault injection techniques are generally used to determine the reliability profiles of programs in the presence of soft errors, but these techniques are highly resource and time intensive. Trident, an analytical model, was developed for predicting SDC probabilities of CPU programs based on its 3-level modeling technique. However, it is not clear how accurate such analytical modeling is in predicting the SDC probabilities of GPU programs, which are highly parallel and have a very different programming model from CPU programs. In this paper, we adopt the original TRIDENT methodology for modeling error propagation in CUDA-based GPU applications, and examine the accuracy of SDC predictions versus fault injection experiments. We find that there is a discrepancy between the prediction results and the fault injection results due to differing memory behavior in GPU programs. We also observe that a large number of threads in the GPU applications increases the information to be profiled, which complicates profiling in TRIDENT, resulting in significant slowdown. We analyze the results, investigate the bottlenecks of TRIDENT in GPU applications, and propose potential solutions to mitigate the problems.