| Research

Bit-Plane Compression: Transforming Data for Better Compression in Many-core Architectures

As key applications become more data-intensive and the computational throughput of processors increases, the amount of data to be transferred in modern memory subsystems grows. Increasing physical bandwidth to keep up with the demand growth is challenging, however, due to strict area and energy limitations. This paper presents a novel and lightweight compression algorithm, Bit-Plane Compression (BPC), to increase the effective memory bandwidth.

Read more about Bit-Plane Compression: Transforming Data for Better Compression in Many-core Architectures

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Deploying deep learning (DL)models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used parallelization strategy, but as the number of devices in data parallel training grows, so does the communication overhead between devices. Additionally, a larger aggregate batch size per step leads to statistical efficiency loss, i.e., a larger number of epochs are required to converge to a desired accuracy.

Read more about Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

BYOC: A "Bring Your Own Core" Framework for Heterogeneous-ISA Research

Heterogeneous architectures and heterogeneous-ISA designs are growing areas of computer architecture and system software research. Unfortunately, this line of research is significantly hindered by the lack of experimental systems and modifiable hardware frameworks. This work proposes BYOC, a "Bring Your Own Core" framework that is specifically designed to enable heterogeneous-ISA and heterogeneous system research.

Read more about BYOC: A "Bring Your Own Core" Framework for Heterogeneous-ISA Research

GPU Domain Specialization via Composable On-Package Architecture

As GPUs scale their low-precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that a converged GPU design trying to address diverging architectural requirements between FP32 (or larger)-based HPC and FP16 (or smaller)-based DL workloads results in sub-optimal configurations for either of the application domains.

Read more about GPU Domain Specialization via Composable On-Package Architecture

Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator.

The demands of high-performance computing (HPC) and machine learning (ML) workloads have resulted in the rapid architectural evolution of GPUs over the last decade. The growing memory footprint and diversity of data types in these workloads has required GPUs to embrace micro-architectural heterogeneity and increased memory system sophistication to scale performance. Effective simulation of new architectural features early in the design cycle enables quick and effective exploration of design trade-offs across this increasingly diverse set of workloads.

Read more about Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator.

Full-Stack Memory Model Verification with TriCheck

Memory consistency models (MCMs) govern inter-module interactions in a shared memory system and are defined at the various layers of the hardware-software stack. TriCheck is the first tool for full-stack MCM verification. Using TriCheck, we uncovered under-specifications in the draft RISC-V instruction set architecture (ISA) and identified flaws in previously “proven-correct” C11 compiler mappings.

Read more about Full-Stack Memory Model Verification with TriCheck

PipeProof: Automated Memory Consistency Proofs for Microarchitectural Specifications

Memory consistency models (MCMs) specify rules which constrain the values that can be returned by load instructions in parallel programs. To ensure that parallel programs run correctly, verification of hardware MCM implementations would ideally be complete; i.e. verified as being correct across all possible executions of all possible programs. However, no existing automated approach is capable of such complete verification.

Read more about PipeProof: Automated Memory Consistency Proofs for Microarchitectural Specifications

CheckMate: Automated Synthesis of Hardware Exploits and Security Litmus Tests

Recent research has uncovered a broad class of security vulnerabilities in which confidential data is leaked through programmer-observable microarchitectural state. In this paper, we present CheckMate, a rigorous approach and automated tool for determining if a microarchitecture is susceptible to specified classes of security exploits, and for synthesizing proof-of-concept exploit code when it is.

Read more about CheckMate: Automated Synthesis of Hardware Exploits and Security Litmus Tests

RTLCheck: Verifying Memory Consistency in RTL Designs

Paramount to the viability of a parallel architecture is the correct implementation of its memory consistency model (MCM). Although tools exist for verifying consistency models at several design levels, a problematic verification gap exists between checking an abstract microarchitectural specification of a consistency model and verifying that the actual processor RTL implements it correctly.

Read more about RTLCheck: Verifying Memory Consistency in RTL Designs

A Formal Analysis of the NVIDIA PTX Memory Consistency Model.

This paper presents the first formal analysis of the official memory consistency model for the NVIDIA PTX virtual ISA. Like other GPU memory models, the PTX memory model is weakly ordered but provides scoped synchronization primitives that enable GPU program threads to communicate through memory. However, unlike some competing GPU memory models, PTX does not require data race freedom, and this results in PTX using a fundamentally different (and more complicated) set of rules in its memory model.

Read more about A Formal Analysis of the NVIDIA PTX Memory Consistency Model.

Subscribe to