1. [Publications](/publications)
2. Towards Precision-Aware Fault Tolerance Approaches for Mixed-Precision Applications
 
 # Towards Precision-Aware Fault Tolerance Approaches for Mixed-Precision Applications

  ![Publication image](/sites/default/files/styles/wide/public/default_images/default.jpeg?itok=qUFsuJCP "Publication image")

 Graphics Processing Units (GPUs), the dominantly adopted accelerators in HPC systems, are susceptible to a transient hardware fault. A new generation of GPUs features mixed-precision architectures such as NVIDIA Tensor Cores to accelerate matrix multiplications. While widely adapted, how they would behave under transient hardware faults remain unclear. In this study, we conduct large-scale fault injection experiments on GEMM kernels implemented with different floating-point data types on the V100 and A100 Tensor Cores and show distinct error resilience characteristics for the GEMMS with different formats. We plan to explore this space in the future by building precision-aware floating-point fault tolerance techniques for applications such as DNNs that exercise low-precision computations.



 ## Authors



Bo Fang (Pacific Northwest National Laboratory)

[Siva Hari](/person/siva-hari)

Timothy Tsai (NVIDIA)

Xinyi Li (University of Utah)

Ganesh Gopalakrishnan (University of Utah)

Ignacio Laguna (Lawrence Livermore National Laboratory)

Kevin Barker (Pacific Northwest National Laboratory)

Ang Li (Pacific Northwest National Laboratory)

 

 

 ## Publication Date



Sunday, November 13, 2022

 

 ## Published in



[Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS)](https://ieeexplore.ieee.org/document/10024043)

 

 ## Research Area



[Computer Architecture](/research-area/computer-architecture)

[High Performance Computing](/research-area/high-performance-computing)

[Resilience and Safety](/research-area/resilience)

 

 

 ## External Links



[IEEE Digital Library](https://ieeexplore.ieee.org/document/10024043)

 

 

 ## Uploaded Files



[Published manuscript](https://d1qx31qr3h6wln.cloudfront.net/publications/FTXS_2022_FT_Mixed.pdf "Open file in new window")491.47 KB

 

 

 ## Copyright



This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to <pubs-permissions@ieee.org>.