DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes.

To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with

the use of page walk caches. Consequently, LLT misses incur long address translation latency and hurt

performance. This paper proposes two low-overhead hardware mechanisms for reducing the frequency and

penalty of on-die LLT misses. The first, Unified CAche and TLB (UCAT), enables the conventional on-die

On the Trend of Resilience for GPU-Dense Systems

Emerging high-performance computing (HPC) systems show a tendency towards heterogeneous nodes that are dense with accelerators such as GPUs. They offer higher computational power at lower energy and cost than homogeneous CPU-only nodes. While an accelerator-rich machine reduces the total number of compute nodes required to achieve a performance target, a single node becomes susceptible to accelerator failures as well as sharing intra-node resources with many accelerators. Such failures must be recovered by end-to-end resilience schemes such as checkpoint-restart.

Reduced Precision DWC: An Efficient Hardening Strategy for Mixed-Precision Architectures

Duplication with Comparison (DWC) is an effective software-level solution to improve the reliability of computing devices. However, it introduces performance and energy consumption overheads that could be unsuitable for high-performance computing or real-time safety-critical applications. In this article, we present Reduced-Precision Duplication with Comparison (RP-DWC) as a means to lower the overhead of DWC by executing the redundant copy in reduced precision.

Adaptive Memory-Side Last-Level GPU Caching

Emerging GPU applications exhibit increasingly high computation demands which has led GPU manufacturers to build GPUs with an increasingly large number of streaming multiprocessors (SMs). Providing data to the SMs at high bandwidth puts significant pressure on the memory hierarchy and the Network-on-Chip (NoC). Current GPUs typically partition the memory-side last-level cache (LLC) in equally-sized slices that are shared by all SMs.

Rama Akkiraju

Rama Akkiraju is a vice president of AI for IT at NVIDIA. Prior to this, Rama held various leadership roles at IBM in her two decades-plus career. Most recently, Rama was an IBM Fellow, Master Inventor, and an IBM Academy Member at IBM, where she was the CTO of IBM Watson AI Operations product, a product to optimize IT operations management with AI.

DiGamma: Domain-aware Genetic Algorithm for HW-Mapping Co-optimization for DNN Accelerators

The design of DNN accelerators includes two key parts: HW resource configuration and mapping strategy. Intensive research has been conducted to optimize each of them independently. Unfortunately, optimizing for both together is extremely challenging due to the extremely large cross-coupled search space. To address this, in this paper, we propose a HW-Mapping cooptimization framework, an efficient encoding of the immense design space constructed by HW and Mapping, and a domainaware genetic algorithm, named DiGamma, with specialized operators for improving search efficiency.

Self Adaptive Reconfigurable Arrays (SARA): Learning Flexible GEMM Accelerator Configuration and Mapping-space using ML

This work demonstrates a scalable reconfigurable accelerator (RA) architecture designed to extract maximum performance and energy efficiency for GEMM workloads. We also present a self-adaptive (SA) unit, which runs a learnt model for one-shot configuration optimization in hardware offloading the software stack thus easing the deployment of the proposed design.

Suraksha: A Quantitative AV Safety Evaluation Framework to Analyze Safety Implications of Perception Design Choices

This paper proposes an automated AV safety evaluation framework, Suraksha that quantifies and analyzes the sensitivities of different design parameters on AV safety. It employs a set of driving scenarios generated based on a user-specified difficulty level. It enables the exploration of tradeoffs in requirements either in existing AV implementations to find opportunities for improvement or during the development process to explore the component-level requirements for an optimal and safe AV architecture.

Generating and Characterizing Scenarios for Safety Testing of Autonomous Vehicles

Extracting interesting scenarios from real world data as well as generating failure cases is important for the development and testing of autonomous systems. We propose efficient mechanisms to both characterize and generate testing scenarios using a state-of-the-art driving simulator. For any scenario, our method generates a set of possible driving paths and identifies all the possible safe driving trajectories that can be taken starting at different times, to compute metrics that quantify the complexity of the scenario.