| Research

RTLCheck: Verifying Memory Consistency in RTL Designs

Paramount to the viability of a parallel architecture is the correct implementation of its memory consistency model (MCM). Although tools exist for verifying consistency models at several design levels, a problematic verification gap exists between checking an abstract microarchitectural specification of a consistency model and verifying that the actual processor RTL implements it correctly.

Read more about RTLCheck: Verifying Memory Consistency in RTL Designs

A Formal Analysis of the NVIDIA PTX Memory Consistency Model.

This paper presents the first formal analysis of the official memory consistency model for the NVIDIA PTX virtual ISA. Like other GPU memory models, the PTX memory model is weakly ordered but provides scoped synchronization primitives that enable GPU program threads to communicate through memory. However, unlike some competing GPU memory models, PTX does not require data race freedom, and this results in PTX using a fundamentally different (and more complicated) set of rules in its memory model.

Read more about A Formal Analysis of the NVIDIA PTX Memory Consistency Model.

The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

With deep reinforcement learning (RL) methods achieving results that exceed human capabilities in games, robotics, and simulated environments, continued scaling of RL training is crucial to its deployment in solving complex real-world problems. However, improving the performance scalability and power efﬁciency of RL training through understanding the architectural implications of CPU-GPU systems remains an open problem.

Read more about The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems

Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks

The pin count largely determines the cost of a chip package, which is often comparable to the cost of a die. In 3D processor-memory designs, power and ground (P/G) pins can account for the majority of the pins. This is because packages include separate pins for the disjoint processor and memory power delivery networks (PDNs). Supporting separate PDNs and P/G pins for processor and memory is inefficient, as each set has to be provisioned for the worst-case power delivery requirements.

Read more about Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks

Characterizing and Mitigating Soft Errors in GPU DRAM

While graphics processing units (GPUs) are used in high-reliability systems,wide GPU dynamic random-access memory (DRAM) interfaces make error protection difficult, as wide-device correction through error checking and correcting (ECC) is expensive and impractical. This challenge is compounded by worsening relative rates of multibit DRAM errors and increasing GPU memory capacities. This work uses high-energy neutron beam tests to inform the design and evaluation of GPU DRAM error-protection mechanisms.

Read more about Characterizing and Mitigating Soft Errors in GPU DRAM

Understanding the Future of Energy Efficiency in Multi-Module GPUs.

As Moore’s law slows down, GPUs must pivot towards multi-module designs to continue scaling performance at historical rates. Prior work on multi-module GPUs has focused on performance, while largely ignoring the issue of energy efﬁciency. In this work, we propose a new metric for GPU efﬁciency called EDP Scaling Efﬁciency that quantiﬁes the effects of both strong performance scaling and overall energy efﬁciency in these designs. To enable this analysis, we develop a novel top-down GPU energy estimation framework that is accurate within 10% of a recent GPU design.

Read more about Understanding the Future of Energy Efficiency in Multi-Module GPUs.

Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks

In upcoming architectures that stack processor and DRAM dies, temperatures are higher because of the increased transistor density and the high inter-layer thermal resistance. However, past research has underestimated the extent of the thermal bottleneck. Recent experimental work shows that the Die-to-Die (D2D) layers hinder effective heat transfer, likely leading to the capping of core frequencies. To address this problem, in this paper, we first show how to create pillars of high thermal conduction from the processor die to the heat sink.

Read more about Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks

SEC-BADAEC: An Efficient ECC With No Vacancy for Strong Memory Protection

Shrinking process technology and rising memory densities have made memories increasingly vulnerable to errors. Accordingly, DRAM vendors have introduced On-die Error Correction Code (O-ECC) to protect data against the growing number of errors. Current O-ECC provides weak Single Error Correction (SEC), but future memories will require stronger protection as error rates rise. This paper proposes a novel ECC, called Single Error Correction--Byte-Aligned Double Adjacent Error Correction (SEC-BADAEC), and its construction algorithm to improve memory reliability.

Read more about SEC-BADAEC: An Efficient ECC With No Vacancy for Strong Memory Protection

vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design.

The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher’s flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs.

Read more about vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design.

A Patch Memory System For Image Processing and Computer Vision.

From self-driving cars to high dynamic range (HDR) imaging, the demand for image-based applications is growing quickly. In mobile systems, these applications place particular strain on performance and energy efficiency. As traditional memory systems are optimized for 1D memory access, they are unable to efficiently exploit the multi-dimensional locality characteristics of image-based applications which often operate on sub-regions of 2D and 3D image data.

Read more about A Patch Memory System For Image Processing and Computer Vision.

Subscribe to