Locality-Centric Data and Threadblock Management for Massive GPUs

Recent work has shown that building GPUs with hundreds of SMs in a single monolithic chip will not be practical due to slowing growth in transistor density, low chip yields, and photoreticle limitations. To maintain performance scalability, proposals exist to aggregate discrete GPUs into a larger virtual GPU and decompose a single GPU into multiple-chip-modules with increased aggregate die area. These approaches introduce non-uniform memory access (NUMA) effects and lead to decreased performance and energy-efficiency if not managed appropriately.

Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs

As communication protocols evolve, datacenter network utilization increases. As a result, congestion is more frequent, causing higher latency and packet loss. Combined with the increasing complexity of workloads, manual design of congestion control (CC) algorithms becomes extremely difficult. This calls for the development of AI approaches to replace the human effort. Unfortunately, it is currently not possible to deploy AI models on network devices due to their limited computational capabilities.

Never Worse, Mostly Better: Stable Policy Improvement in Deep Reinforcement Learning

In recent years, there has been significant progress in applying deep reinforcement learning (RL) for solving challenging problems across a wide variety of domains. Nevertheless, convergence of various methods has been shown to suffer from inconsistencies, due to algorithmic instability and variance, as well as stochasticity in the benchmark environments. Particularly, despite the fact that the agent's performance may be improving on average, it may abruptly deteriorate at late stages of training.

Towards Precision-Aware Fault Tolerance Approaches for Mixed-Precision Applications

Graphics Processing Units (GPUs), the dominantly adopted accelerators in HPC systems, are susceptible to a transient hardware fault. A new generation of GPUs features mixed-precision architectures such as NVIDIA Tensor Cores to accelerate matrix multiplications. While widely adapted, how they would behave under transient hardware faults remain unclear.

Zhuyi: Perception Processing Rate Estimation for Safety in Autonomous Vehicles

The processing requirement of autonomous vehicles (AVs) for high-accuracy perception in complex scenarios can exceed the resources offered by the in-vehicle computer, degrading safety and comfort. This paper proposes a sensor frame processing rate (FPR) estimation model, Zhuyi, that quantifies the minimum safe FPR continuously in a driving scenario. Zhuyi can be employed post-deployment as an online safety check and to prioritize work.

DUO: Exposing On-chip Redundancy to Rank-Level ECC for High Reliability

DRAM row and column sparing cannot efficiently tolerate the increasing inherent fault rate caused by continued process scaling. In-DRAM ECC (IECC), an appealing alternative to sparing, can resolve inherent faults without significant changes to DRAM, but it is inefficient for highly-reliable systems where rank-level ECC (RECC) is already used against operational faults.