Evaluating and Accelerating High-Fidelity Error Injection for HPC

We address two important concerns in the analysis of the behavior of applications in the presence of hardware errors: (1) when is it important to model how hardware faults lead to erroneous values (instruction-level errors) with high fidelity, as opposed to using simple bit-flipping models, and (2) how to enable fast high-fidelity error injection campaigns, in particular when error detectors are employed.

Hamartia: A Fast and Accurate Error Injection Framework

Single bit-flip has been the most popular error model for resilience studies with fault injection. We use RTL gate-level fault injection to show that this model fails to cover many realistic hardware faults. Specifically, single-event transients from combinational logic and single-event upsets in pipeline latches can lead to complex multi-bit errors at the architecture level. However, although accurate, RTL simulation is too slow to evaluate application-level resilience.

CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory.

SwapCodes: Error Codes for Hardware-Software Cooperative GPU Pipeline Error Detection

Intra-thread instruction duplication offers straightforward and effective pipeline error detection for data-intensive processors. However, software-enforced instruction duplication uses explicit checking instructions, roughly doubles program register usage, and doubles the number of arithmetic operations per thread, potentially leading to severe slowdowns. This paper investigates SwapCodes, a family of software-hardware cooperative mechanisms to accelerate intra-thread duplication in GPUs.

Umar Iqbal

I am a Senior Research Scientist at NVIDIA Research and part of the Machine Learning and Perception group headed by Jan Kautz. Prior to that, I completed my Ph.D. in Computer Science (2014-2018) from the University of Bonn, Germany, under the supervision of Prof. Juergen Gall.

Arash Vahdat

Arash Vahdat is a Research Director, leading the fundamental generative AI research (GenAIR) team at NVIDIA Research. Before joining NVIDIA, he was a research scientist at D-Wave Systems, working on generative learning and its applications in label-efficient training. Before D-Wave, Arash was a research faculty member at Simon Fraser University (SFU), where he led deep learning-based video analysis research and taught master courses on machine learning for big data.

Zi Yan

My research interests focus on Computer Architecture and Operating Systems, especially on Virtual Memory. Virtual memory is a nice middle layer providing good programmability and performance, but now needs to catch up with new Heterogenous Memory Systems.

Jeff Smith

Jeff Smith joined NVIDIA in 2016, working on computer vision and deep learning tools for autonomous aerial vehicles.  He joined NVIDIA Research in 2018 where he develops tools and systems for deep learning in the fields of computer perception and robotics.

Metaoptimization on a Distributed System for Deep Reinforcement Learning

Training intelligent agents through reinforcement learning is a notoriously unstable procedure. Massive parallelization on GPUs and distributed systems has been exploited to generate a large amount of training experiences and consequently reduce instabilities, but the success of training remains strongly influenced by the choice of the hyperparameters.