A Scalable Architecture for Ordered Parallelism

We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified timestamps. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover ordered parallelism.

CCICheck: Using μhb Graphs to Verify the Coherence-Consistency Interface

In parallel systems, memory consistency models and cache coherence protocols establish the rules specifying which values will be visible to each instruction of parallel programs. Despite their central importance, verifying their correctness has remained a major challenge, due both to informal or incomplete specifications and to difficulties in scaling verification to cover their operations comprehensively.

Measuring the Radiation Reliability of SRAM Structures in GPUs Designed for HPC

Processing Units specifically designed for High Performance Computing applications require a higher reliability than GPUs used for graphic rendering or gaming. Particular attention should be given to GPU memory structures because these components have been shown to be the most vulnerable for various codes. This paper describes a test framework to assess neutron sensitivity of GPU caches and register files. It also presents results from an extensive radiation test campaign that was performed at LANSCE in Los Alamos, New Mexico.

Scaling Irregular Applications through Data Aggregation and Software Multithreading

Emerging applications in areas such as bioinformatics, data analytics, semantic databases and knowledge discovery employ datasets from tens to hundreds of terabytes. Currently, only distributed memory clusters have enough aggregate space to enable in-memory processing of datasets of this size. However, in addition to large sizes, the data structures used by these new application classes are usually charac- terized by unpredictable and fine-grained accesses: i.e., they present an irregular behavior.

A Comparative Analysis of Microarchitecture Effects on CPU and GPU Memory System Behavior

While heterogeneous CPU/GPU systems have been traditionally implemented on separate chips, each with their own private DRAM, heterogeneous processors are integrating these different core types on the same die with access to a common physical memory. Further, emerging heterogeneous CPU-GPU processors promise to offer tighter coupling between core types via a unified virtual address space and cache coherence.

Arbitrary Modulus Indexing

Modern high performance processors require memory systems that can provide access to data at a rate that is well matched to the processor’s computation rate. Common to such systems is the organization of memory into local high speed memory banks that can be accessed in parallel. Associative look up of values is made efficient through indexing instead of associative memories. These techniques lose effectiveness when data locations are not mapped uniformly to the banks or cache locations, leading to bottlenecks that arise from excess demand on a subset of locations.

21st Century Digital Design Tools

Most chips today are designed with 20th century CAD tools. These tools, and the abstractions they are based on, were originally intended to handle designs of millions of gates or less. They are not up to the task of handling today's billion-gate designs. The result is months of delay and considerable labor from final RTL to tapeout. Surprises in timing closure, global congestion, and power consumption are common. Even taking an existing design to a new process node is a time-consuming and laborious process.

Convergence and Scalarization for Data-Parallel Architectures

Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One drawback of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess instruction dispatch, register file accesses, and memory operations.

Visibility Algorithms for Dynamic Dependence Analysis and Distributed Coherence

Implicitly parallel programming systems must solve the joint problems of dependence analysis and coherence to ensure apparently-sequential semantics for applications run on distributed memory machines. Solving these problems in the presence of data-dependent control flow and arbitrary aliasing is a challenge that most existing systems eschew by compromising the expressivity of their programming models and/or the performance of their implementations. We demonstrate a general class of solutions to these problems via a reduction to the visibility problem from computer graphics.