Reduced Precision DWC: An Efficient Hardening Strategy for Mixed-Precision Architectures

Duplication with Comparison (DWC) is an effective software-level solution to improve the reliability of computing devices. However, it introduces performance and energy consumption overheads that could be unsuitable for high-performance computing or real-time safety-critical applications. In this article, we present Reduced-Precision Duplication with Comparison (RP-DWC) as a means to lower the overhead of DWC by executing the redundant copy in reduced precision. RP-DWC is particularly suitable for modern mixed-precision architectures, such as NVIDIA GPUs, that feature dedicated functional units for computing with programmable accuracy. We discuss the benefits and challenges associated with RP-DWC and show that the intrinsic difference between the mixed-precision copies allows for detecting most, but not all, errors. However, as the undetected faults are the ones that fall into the difference between precisions, they are the ones that produce a much smaller impact on the application output and, thus, might be tolerated. We investigate RP-DWC impact into fault detection, performance, and energy consumption on Volta GPUs. Through fault injection and beam experiment, using three microbenchmarks and four real applications, we show that RP-DWC achieves an excellent coverage (up to 86 percent) with minimal overheads (as low as 0.1 percent time and 24 percent energy consumption overhead).

Fernando F. dos Santos (Universidade Federal do Rio Grande do Sul (UFRGS))
Marcelo Brandalero (Brandenburg University of Texchnology Cottbus-Senftenberg (B-TU))
Pedro M. Basso (Universidade Federal do Rio Grande do Sul (UFRGS))
Michael Hubner (Brandenburg University of Texchnology Cottbus-Senftenberg (B-TU))
Luigi Carro (Universidade Federal do Rio Grande do Sul (UFRGS))
Paolo Rech (Politecnico di Torino)
Publication Date