Optimizing Selective Protection for CNN Resilience

As CNNs are being extensively employed in high performance and safety-critical applications that demand high reliability, it is important to ensure that they are resilient to transient hardware errors. Traditional full redundancy solutions provide high error coverage, but the associated overheads are often prohibitively high for resource-constrained systems. In this work, we propose software-directed selective protection techniques to target the most vulnerable work in a CNN, providing a low-cost solution. We propose and evaluate two domain-specific selective protection techniques for CNNs that target different granularities. First, we develop a feature-map level resilience technique (FLR), which identifies and statically protects the most vulnerable feature maps in a CNN. Second, we develop an inference level resilience technique (ILR), which selectively reruns vulnerable inferences by analyzing their output. Third, we show that the combination of both techniques (FILR) is highly efficient, achieving nearly full error coverage (99.78% on average) for quantized inferences via selective protection. Our tunable approach enables developers to evaluate CNN resilience to hardware errors before deployment using MAC operations as overhead for quicker trade-off analysis. For example, targeting 100% error coverage on ResNet50 with FILR requires 20.8% additional MACs, while measurements on a Jetson Xavier GPU shows 4.6% runtime overhead.

Authors

Abdulrahman Mahmoud (Harvard University)

Siva Hari

Christopher W. Fletcher (University of Illinois at Urbana-Champaign)

Sarita V. Adve (University of Illinois at Urbana-Champaign)

Charbel Sakr

Naresh Shanbhag (University of Illinois at Urbana-Champaign)

Pavlo Molchanov

Michael B. Sullivan

Timothy Tsai (NVIDIA)

Steve Keckler

Publication Date

Monday, October 25, 2021

Published in

International Symposium on Software Reliability Engineering (ISSRE)

Research Area

Artificial Intelligence and Machine Learning

Computer Architecture

Resilience and Safety

External Links

IEEE Digital Library

Uploaded Files

Published manuscript1.69 MB

Copyright

This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org.