Demystifying GPU Reliability: Comparing and Combining Beam Experiments, Fault Simulation, and Profiling

Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU’s computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability.

In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs.We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than 5x) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.

Authors: 
Fernando Fernandes dos Santos (UFRGS, Brazil)
Pedro Martins Basso (UFRGS, Brazil)
Luigi Carro (UFRGS, Brazil)
Paolo Rech (Politecnico di Torino, Italy)
Publication Date: 
Monday, May 17, 2021
Uploaded Files: