Understanding the Future of Energy Efficiency in Multi-Module GPUs.

As Moore’s law slows down, GPUs must pivot towards multi-module designs to continue scaling performance at historical rates. Prior work on multi-module GPUs has focused on performance, while largely ignoring the issue of energy efficiency. In this work, we propose a new metric for GPU efficiency called EDP Scaling Efficiency that quantifies the effects of both strong performance scaling and overall energy efficiency in these designs. To enable this analysis, we develop a novel top-down GPU energy estimation framework that is accurate within 10% of a recent GPU design.

Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks

In upcoming architectures that stack processor and DRAM dies, temperatures are higher because of the increased transistor density and the high inter-layer thermal resistance. However, past research has underestimated the extent of the thermal bottleneck. Recent experimental work shows that the Die-to-Die (D2D) layers hinder effective heat transfer, likely leading to the capping of core frequencies. To address this problem, in this paper, we first show how to create pillars of high thermal conduction from the processor die to the heat sink.

SEC-BADAEC: An Efficient ECC With No Vacancy for Strong Memory Protection

Shrinking process technology and rising memory densities have made memories increasingly vulnerable to errors. Accordingly, DRAM vendors have introduced On-die Error Correction Code (O-ECC) to protect data against the growing number of errors. Current O-ECC provides weak Single Error Correction (SEC), but future memories will require stronger protection as error rates rise. This paper proposes a novel ECC, called Single Error Correction--Byte-Aligned Double Adjacent Error Correction (SEC-BADAEC), and its construction algorithm to improve memory reliability.

vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design.

The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher’s flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs.

A Patch Memory System For Image Processing and Computer Vision.

From self-driving cars to high dynamic range (HDR) imaging, the demand for image-based applications is growing quickly. In mobile systems, these applications place particular strain on performance and energy efficiency. As traditional memory systems are optimized for 1D memory access, they are unable to efficiently exploit the multi-dimensional locality characteristics of image-based applications which often operate on sub-regions of 2D and 3D image data.

ML-driven Malware that Targets AV Safety

Ensuring the safety of autonomous vehicles (AVs) is critical for their mass deployment and public adoption. However, security attacks that violate safety constraints and cause accidents are a significant deterrent to achieving public trust in AVs, and that hinders a vendor’s ability to deploy AVs. Creating a security hazard that results in a severe safety compromise (for example, an accident) is compelling from an attacker’s perspective.

AV-FUZZER: Finding Safety Violations in Autonomous Driving Systems

This paper proposes AV-FUZZER, a testing framework, to find the safety violations of an autonomous vehicle (AV) in the presence of an evolving traffic environment. We perturb the driving maneuvers of traffic participants to create situations in which an AV can run into safety violations. To optimally search for the perturbations to be introduced, we leverage domain knowledge of vehicle dynamics and genetic algorithm to minimize the safety potential of an AV over its projected trajectory.

Security Verification through Automatic Hardware-Aware Exploit Synthesis: The CheckMate Approach.

Many hardware security exploits result from the combination of well-known attack classes with newly exploited hardware features. CheckMate is an approach and automated tool for evaluating microarchitectural susceptibility to specified attack classes, and for synthesizing proof-of-concept exploit code for susceptible designs.

P-OPT: Practical Optimal Cache Replacement for Graph Analytics

Graph analytics is an important workload that achieves suboptimal performance due to poor cache locality. State-of-the-art cache replacement policies fail to capture the highly dynamic and input-specific reuse patterns of graph application data. The main insight of this work is that for graph applications, the transpose of a graph succinctly represents the next references of all vertices in a graph execution; enabling an efficient emulation of Belady's MIN replacement policy.

HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems

Prior work on GPU cache coherence has shown that simple hardware- or software-based protocols can be more than sufficient. However, in recent years, features such as multi-chip modules have added deeper hierarchy and non-uniformity into GPU memory systems. GPU programming models have chosen to expose this non-uniformity directly to the end user through scoped memory consistency models. As a result, there is room to improve upon earlier coherence protocols that were designed only for flat single-GPU hierarchies and/or simpler memory consistency models.