GPU Domain Specialization via Composable On-Package Architecture

As GPUs scale their low-precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that a converged GPU design trying to address diverging architectural requirements between FP32 (or larger)-based HPC and FP16 (or smaller)-based DL workloads results in sub-optimal configurations for either of the application domains.

Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator.

The demands of high-performance computing (HPC) and machine learning (ML) workloads have resulted in the rapid architectural evolution of GPUs over the last decade. The growing memory footprint and diversity of data types in these workloads has required GPUs to embrace micro-architectural heterogeneity and increased memory system sophistication to scale performance. Effective simulation of new architectural features early in the design cycle enables quick and effective exploration of design trade-offs across this increasingly diverse set of workloads.

Full-Stack Memory Model Verification with TriCheck

Memory consistency models (MCMs) govern inter-module interactions in a shared memory system and are defined at the various layers of the hardware-software stack. TriCheck is the first tool for full-stack MCM verification. Using TriCheck, we uncovered under-specifications in the draft RISC-V instruction set architecture (ISA) and identified flaws in previously “proven-correct” C11 compiler mappings.

PipeProof: Automated Memory Consistency Proofs for Microarchitectural Specifications

Memory consistency models (MCMs) specify rules which constrain the values that can be returned by load instructions in parallel programs. To ensure that parallel programs run correctly, verification of hardware MCM implementations would ideally be complete; i.e. verified as being correct across all possible executions of all possible programs. However, no existing automated approach is capable of such complete verification.

CheckMate: Automated Synthesis of Hardware Exploits and Security Litmus Tests

Recent research has uncovered a broad class of security vulnerabilities in which confidential data is leaked through programmer-observable microarchitectural state. In this paper, we present CheckMate, a rigorous approach and automated tool for determining if a microarchitecture is susceptible to specified classes of security exploits, and for synthesizing proof-of-concept exploit code when it is.

RTLCheck: Verifying Memory Consistency in RTL Designs

Paramount to the viability of a parallel architecture is the correct implementation of its memory consistency model (MCM). Although tools exist for verifying consistency models at several design levels, a problematic verification gap exists between checking an abstract microarchitectural specification of a consistency model and verifying that the actual processor RTL implements it correctly.

A Formal Analysis of the NVIDIA PTX Memory Consistency Model.

This paper presents the first formal analysis of the official memory consistency model for the NVIDIA PTX virtual ISA. Like other GPU memory models, the PTX memory model is weakly ordered but provides scoped synchronization primitives that enable GPU program threads to communicate through memory. However, unlike some competing GPU memory models, PTX does not require data race freedom, and this results in PTX using a fundamentally different (and more complicated) set of rules in its memory model.

The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

With deep reinforcement learning (RL) methods achieving results that exceed human capabilities in games, robotics, and simulated environments, continued scaling of RL training is crucial to its deployment in solving complex real-world problems. However, improving the performance scalability and power efficiency of RL training through understanding the architectural implications of CPU-GPU systems remains an open problem.

Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks

The pin count largely determines the cost of a chip package, which is often comparable to the cost of a die. In 3D processor-memory designs, power and ground (P/G) pins can account for the majority of the pins. This is because packages include separate pins for the disjoint processor and memory power delivery networks (PDNs). Supporting separate PDNs and P/G pins for processor and memory is inefficient, as each set has to be provisioned for the worst-case power delivery requirements.

Characterizing and Mitigating Soft Errors in GPU DRAM

While graphics processing units (GPUs) are used in high-reliability systems,wide GPU dynamic random-access memory (DRAM) interfaces make error protection difficult, as wide-device correction through error checking and correcting (ECC) is expensive and impractical. This challenge is compounded by worsening relative rates of multibit DRAM errors and increasing GPU memory capacities. This work uses high-energy neutron beam tests to inform the design and evaluation of GPU DRAM error-protection mechanisms.