Resilience and Safety
Associated Publications
GPU-Trident: Efficient Modeling of Error Propagation in GPU ProgramsGenerating and Characterizing Scenarios for Safety Testing of Autonomous Vehicles
AV-FUZZER: Finding Safety Violations in Autonomous Driving Systems
PyTorchFI: A Runtime Perturbation Tool for DNNs
Making Convolutions Resilient via Algorithm-Based Error Detection Techniques
Estimating Silent Data Corruption Rates Using a Two-Level Model
Feature Map Vulnerability Evaluation in CNNS
GPU Snapshot: Checkpoint Offloading for GPU-Dense Systems
On the Trend of Resilience for GPU-Dense Systems
ML-based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection
Optimizing Software-Directed Instruction Replication for GPU Error Detection
Evaluating and Accelerating High-Fidelity Error Injection for HPC
Kayotee: A Fault Injection-based System to Assess the Safety and Reliability of Autonomous Vehicles to Faults and Errors
SwapCodes: Error Codes for Hardware-Software Cooperative GPU Pipeline Error Detection
CRUM: Checkpoint-Restart Support for CUDA's Unified Memory
Modeling Soft Error Propagation in Programs
Hamartia: A Fast and Accurate Error Injection Framework
Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications
SASSIFI: An Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation
Approxilyzer: Towards A Systematic Framework for Instruction-Level Approximate Computing and its Application to Hardware Resiliency