Security Verification through Automatic Hardware-Aware Exploit Synthesis: The CheckMate Approach.

Many hardware security exploits result from the combination of well-known attack classes with newly exploited hardware features. CheckMate is an approach and automated tool for evaluating microarchitectural susceptibility to specified attack classes, and for synthesizing proof-of-concept exploit code for susceptible designs.

P-OPT: Practical Optimal Cache Replacement for Graph Analytics

Graph analytics is an important workload that achieves suboptimal performance due to poor cache locality. State-of-the-art cache replacement policies fail to capture the highly dynamic and input-specific reuse patterns of graph application data. The main insight of this work is that for graph applications, the transpose of a graph succinctly represents the next references of all vertices in a graph execution; enabling an efficient emulation of Belady's MIN replacement policy.

HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems

Prior work on GPU cache coherence has shown that simple hardware- or software-based protocols can be more than sufficient. However, in recent years, features such as multi-chip modules have added deeper hierarchy and non-uniformity into GPU memory systems. GPU programming models have chosen to expose this non-uniformity directly to the end user through scoped memory consistency models. As a result, there is room to improve upon earlier coherence protocols that were designed only for flat single-GPU hierarchies and/or simpler memory consistency models.

Improving Locality of Irregular Updates with Hardware Assisted Propagation Blocking

Many application domains perform irregular memory updates. Irregular accesses lead to inefficient use of conventional cache hierarchies. To make better use of the cache, we focus on Propagation Blocking (PB), a software-based cache locality optimization initially designed for graph processing applications. We make two contributions in this work. First, we show that PB generalizes beyond graph processing applications to any application with unordered parallelism and irregular memory updates.

Mixed-Proxy Extensions for the NVIDIA PTX Memory Consistency Model

In recent years, there has been a trend towards the use of accelerators and architectural specialization to continue scaling performance in spite of a slowing of Moore’s Law. GPUs have always relied on dedicated hardware for graphics workloads, but modern GPUs now also incorporate compute-domain accelerators such as NVIDIA’s Tensor Cores for machine learning. For these accelerators to be successfully integrated into a general-purpose programming language such as C++ or CUDA, there must be a forward- and backward-compatible API for the functionality they provide.

Exploiting Temporal Data Diversity for Detecting Safety-critical Faults in AV Compute Systems

Silent data corruption caused by random hardware faults in autonomous vehicle (AV) computational elements is a significant threat to vehicle safety. Previous research has explored design diversity, data diversity, and duplication techniques to detect such faults in other safety-critical domains. However, these are challenging to use for AVs in practice due to significant resource overhead and design complexity. We propose, DiverseAV, a low-cost data-diversity-based redundancy technique for detecting safety-critical random hardware faults in computational elements.

Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks

Deep Neural Networks (DNNs) have reinvigorated real-world applications that rely on learning patterns of data and are permeating into different industries and markets. Cloud infrastructure and accelerators that offer INFerence-as-a-Service (INFaaS) have become the enabler of this rather quick and invasive shift in the industry. To that end, mostly acceleratorbased INFaaS (Google’s TPU [1], NVIDIA T4 [2], Microsoft Brainwave [3], etc.) has become the backbone of many real-life applications.

The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express Data Locality in GPUs

Exploiting data locality in GPUs is critical to making more efficient use of the existing caches and the NUMA-based memory hierarchy expected in future GPUs. While modern GPU programming models are designed to explicitly express parallelism, there is no clear explicit way to express data locality—i.e., reusebased locality to make efficient use of the caches, or NUMA

A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap to Enhance Memory Optimization

This paper makes a case for a new cross-layer interface, Expressive Memory (XMem), to communicate higher-level program semantics from the application to the system software and hardware architecture. XMem provides (i) a exible and extensible abstraction, called an Atom, enabling the application to express key program semantics in terms of how the program accesses data and the attributes of the data itself, and (ii) new cross-layer interfaces to make the expressed higher-level information available to the underlying OS and architecture.

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

On-chip contention increases memory access latency for multicore processors. We identify that this additional latency has a substantial effect on performance for an important class of latency-critical memory operations: those that result in a cache miss and are dependent on data from a prior cache miss. We observe that the number of instructions between the first cache miss and its dependent cache miss is usually small.