Bit-Plane Compression: Transforming Data for Better Compression in Many-core Architectures

As key applications become more data-intensive and the computational throughput of processors increases, the amount of data to be transferred in modern memory subsystems grows. Increasing physical bandwidth to keep up with the demand growth is challenging, however, due to strict area and energy limitations. This paper presents a novel and lightweight compression algorithm, Bit-Plane Compression (BPC), to increase the effective memory bandwidth.

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Deploying deep learning (DL)models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used parallelization strategy, but as the number of devices in data parallel training grows, so does the communication overhead between devices. Additionally, a larger aggregate batch size per step leads to statistical efficiency loss, i.e., a larger number of epochs are required to converge to a desired accuracy.

BYOC: A "Bring Your Own Core" Framework for Heterogeneous-ISA Research

Heterogeneous architectures and heterogeneous-ISA designs are growing areas of computer architecture and system software research. Unfortunately, this line of research is significantly hindered by the lack of experimental systems and modifiable hardware frameworks. This work proposes BYOC, a "Bring Your Own Core" framework that is specifically designed to enable heterogeneous-ISA and heterogeneous system research.

GPU Domain Specialization via Composable On-Package Architecture

As GPUs scale their low-precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that a converged GPU design trying to address diverging architectural requirements between FP32 (or larger)-based HPC and FP16 (or smaller)-based DL workloads results in sub-optimal configurations for either of the application domains.

Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator.

The demands of high-performance computing (HPC) and machine learning (ML) workloads have resulted in the rapid architectural evolution of GPUs over the last decade. The growing memory footprint and diversity of data types in these workloads has required GPUs to embrace micro-architectural heterogeneity and increased memory system sophistication to scale performance. Effective simulation of new architectural features early in the design cycle enables quick and effective exploration of design trade-offs across this increasingly diverse set of workloads.

Full-Stack Memory Model Verification with TriCheck

Memory consistency models (MCMs) govern inter-module interactions in a shared memory system and are defined at the various layers of the hardware-software stack. TriCheck is the first tool for full-stack MCM verification. Using TriCheck, we uncovered under-specifications in the draft RISC-V instruction set architecture (ISA) and identified flaws in previously “proven-correct” C11 compiler mappings.

PipeProof: Automated Memory Consistency Proofs for Microarchitectural Specifications

Memory consistency models (MCMs) specify rules which constrain the values that can be returned by load instructions in parallel programs. To ensure that parallel programs run correctly, verification of hardware MCM implementations would ideally be complete; i.e. verified as being correct across all possible executions of all possible programs. However, no existing automated approach is capable of such complete verification.

CheckMate: Automated Synthesis of Hardware Exploits and Security Litmus Tests

Recent research has uncovered a broad class of security vulnerabilities in which confidential data is leaked through programmer-observable microarchitectural state. In this paper, we present CheckMate, a rigorous approach and automated tool for determining if a microarchitecture is susceptible to specified classes of security exploits, and for synthesizing proof-of-concept exploit code when it is.

RTLCheck: Verifying Memory Consistency in RTL Designs

Paramount to the viability of a parallel architecture is the correct implementation of its memory consistency model (MCM). Although tools exist for verifying consistency models at several design levels, a problematic verification gap exists between checking an abstract microarchitectural specification of a consistency model and verifying that the actual processor RTL implements it correctly.

A Formal Analysis of the NVIDIA PTX Memory Consistency Model.

This paper presents the first formal analysis of the official memory consistency model for the NVIDIA PTX virtual ISA. Like other GPU memory models, the PTX memory model is weakly ordered but provides scoped synchronization primitives that enable GPU program threads to communicate through memory. However, unlike some competing GPU memory models, PTX does not require data race freedom, and this results in PTX using a fundamentally different (and more complicated) set of rules in its memory model.