| Research

SASSIFI: Evaluating Resilience of GPU Applications

As GPUs become more pervasive in both scalable high-performance computing systems and safety-critical embedded systems, evaluating and analyzing their resilience will grow increasingly important. As soft errors, such as those caused by high-energy particle strikes, form an important fraction of in-field hardware errors, GPU designers must develop tools and techniques to understand the effect of these soft errors on applications.

Read more about SASSIFI: Evaluating Resilience of GPU Applications

Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

Recently there has been interest in exploring the acceleration of non-vectorizable workloads with spatially-programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: A) how to efficiently control each processing element (PE) in the system, and B) how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this paper, we explore solving these problems using triggered instructions, and latency- insensitive channels.

Read more about Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

Locality-Driven Dynamic GPU Cache Bypassing

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from single- instruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications.

Read more about Locality-Driven Dynamic GPU Cache Bypassing

Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies

High-level abstractions separate algorithm design from platform implementation, allowing programmers to focus on algorithms while building increasingly complex systems. This separation also provides system programmers and compilers an opportunity to optimize platform services for each application. In FPGAs, this platform-level malleability extends to the memory system: unlike general-purpose processors, in which memory hardware is fixed at design time, the capacity, associativity, and topology of FPGA memory systems may all be tuned to improve application performance.

Read more about Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies

GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors

Emerging heterogeneous CPU-GPU processors have introduced unified memory spaces and cache coherence. CPU and GPU cores will be able to concurrently access the same memories, eliminating memory copy overheads and potentially changing the application-level optimization targets. To date, little is known about how developers may organize new applications to leverage the available, finer-grained communication in these processors.

Read more about GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors

A Fast and Accurate Analytical Technique to Compute the AVF of Sequential Bits in a Processor

The rate of particle induced soft errors in a processor increases in proportion to the number of bits. This soft error rate (SER) can limit the performance of a system by placing an effective limit on the number of cores, nodes or clusters. The vulnerability of bits in a processor to soft errors can be represented by their architectural vulnerability factor (AVF), defined as the probability that a bit corruption results in a user-visible error.

Read more about A Fast and Accurate Analytical Technique to Compute the AVF of Sequential Bits in a Processor

A Scalable Architecture for Ordered Parallelism

We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified timestamps. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover ordered parallelism.

Read more about A Scalable Architecture for Ordered Parallelism

CCICheck: Using μhb Graphs to Verify the Coherence-Consistency Interface

In parallel systems, memory consistency models and cache coherence protocols establish the rules specifying which values will be visible to each instruction of parallel programs. Despite their central importance, verifying their correctness has remained a major challenge, due both to informal or incomplete specifications and to difficulties in scaling verification to cover their operations comprehensively.

Read more about CCICheck: Using μhb Graphs to Verify the Coherence-Consistency Interface

Measuring the Radiation Reliability of SRAM Structures in GPUs Designed for HPC

Processing Units specifically designed for High Performance Computing applications require a higher reliability than GPUs used for graphic rendering or gaming. Particular attention should be given to GPU memory structures because these components have been shown to be the most vulnerable for various codes. This paper describes a test framework to assess neutron sensitivity of GPU caches and register files. It also presents results from an extensive radiation test campaign that was performed at LANSCE in Los Alamos, New Mexico.

Read more about Measuring the Radiation Reliability of SRAM Structures in GPUs Designed for HPC

Scaling Irregular Applications through Data Aggregation and Software Multithreading

Emerging applications in areas such as bioinformatics, data analytics, semantic databases and knowledge discovery employ datasets from tens to hundreds of terabytes. Currently, only distributed memory clusters have enough aggregate space to enable in-memory processing of datasets of this size. However, in addition to large sizes, the data structures used by these new application classes are usually charac- terized by unpredictable and fine-grained accesses: i.e., they present an irregular behavior.

Read more about Scaling Irregular Applications through Data Aggregation and Software Multithreading

Subscribe to