As CNNs are increasingly being employed in high performance computing and safety-critical applications, ensuring they are reliable to transient hardware errors is important. Full duplication provides high reliability, but the overheads are prohibitively high for resource constrained systems. Fine-grained resilience evaluation and protection can provide a low-cost solution, but traditional methods for evaluation can be too slow. Traditional approaches use error injections and essentially discard information from experiments that do not corrupt outcomes. In this work, we replace the binary view of errors with a new continuous domain-specific metric based on cross-entropy loss to quantify corruptions, allowing for faster convergence of error analysis. This enables us to scale up to large networks. We study the effectiveness of this method using different error models and also compare with heuristics that aim to predict vulnerability quickly. We show that selective, fine-grained protection of the most vulnerable components of a CNN provides a significantly lower overhead solution than full duplication. Lastly, we present a framework called HarDNN that packages all these solutions for easy application.
Published manuscript3.57 MB