Co-Designing Accelerators and SoC Interfaces Using gem5-Aladdin

Increasing demand for power-efficient, high- performance computing has spurred a growing number and diversity of hardware accelerators in mobile and server Systems on Chip (SoCs). This paper makes the case that the co-design of the accelerator microarchitecture with the system in which it belongs is critical to balanced, efficient accelerator microarchitectures.

EIE: Efficient Inference Engine on Compressed Deep Neural Network

State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power.

In-Memory Graph Databases for Web-Scale Data

A software stack relies primarily on graph-based methods to implement scalable resource description framework databases on top of commodity clusters, providing an inexpensive way to extract meaning from volumes of heterogeneous data.

High Performing Cache Hierarchies for Server Workloads -- Relaxing Inclusion to Capture the Latency Benefits of Exclusive Caches

Increasing transistor density enables adding more on-die cache real-estate. However, devoting more space to the shared last- level-cache (LLC) causes the memory latency bottleneck to move from memory access latency to shared cache access latency. As such, applications whose working set is larger than the smaller caches spend a large fraction of their execution time on shared cache access latency. To address this problem, this paper investigates increasing the size of smaller private caches in the hierarchy as opposed to increasing the shared LLC.

SASSIFI: Evaluating Resilience of GPU Applications

As GPUs become more pervasive in both scalable high-performance computing systems and safety-critical embedded systems, evaluating and analyzing their resilience will grow increasingly important. As soft errors, such as those caused by high-energy particle strikes, form an important fraction of in-field hardware errors, GPU designers must develop tools and techniques to understand the effect of these soft errors on applications.

Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

Recently there has been interest in exploring the acceleration of non-vectorizable workloads with spatially-programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: A) how to efficiently control each processing element (PE) in the system, and B) how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this paper, we explore solving these problems using triggered instructions, and latency- insensitive channels.

Locality-Driven Dynamic GPU Cache Bypassing

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from single- instruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications.

Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies

High-level abstractions separate algorithm design from platform implementation, allowing programmers to focus on algorithms while building increasingly complex systems. This separation also provides system programmers and compilers an opportunity to optimize platform services for each application. In FPGAs, this platform-level malleability extends to the memory system: unlike general-purpose processors, in which memory hardware is fixed at design time, the capacity, associativity, and topology of FPGA memory systems may all be tuned to improve application performance.

GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors

Emerging heterogeneous CPU-GPU processors have introduced unified memory spaces and cache coherence. CPU and GPU cores will be able to concurrently access the same memories, eliminating memory copy overheads and potentially changing the application-level optimization targets. To date, little is known about how developers may organize new applications to leverage the available, finer-grained communication in these processors.

A Fast and Accurate Analytical Technique to Compute the AVF of Sequential Bits in a Processor

The rate of particle induced soft errors in a processor increases in proportion to the number of bits. This soft error rate (SER) can limit the performance of a system by placing an effective limit on the number of cores, nodes or clusters. The vulnerability of bits in a processor to soft errors can be represented by their architectural vulnerability factor (AVF), defined as the probability that a bit corruption results in a user-visible error.