Single-pass Parallel Prefix Scan with Decoupled Look-back

We describe a work-efficient, communication-avoiding, single-pass method for the parallel computation of prefix scan. When consuming input from memory, our algorithm requires only ~2n data movement: n inputs are read, n outputs are written. Our method embodies a decoupled look-back strategy that performs redundant work to dissociate local computation from the latencies of global prefix propagation. Implemented by the CUB library of parallel primitives for GPU architectures, the performance throughput of our parallel prefix scan approaches that of copy operations.

Stack-Based Algorithms for HDR Capture and Reconstruction

High-dynamic-range (HDR) images can be created with standard camera hardware by capturing and combining multiple pictures, each sampling a different segment of the irradiance distribution of a scene. This seemingly straightforward process involves several important steps, which will be the focus of this chapter. We start by examining the problem of selecting the set of exposures that properly measures the full dynamic range of a particular scene, a process known as metering for HDR.

An Analytical Model for Hardened Latch Selection and Exploration

Hardened flip-flops and latches are designed to be resilient to soft errors, maintaining high system reliability in the presence of energetic radiation. The wealth of different hardened designs (with varying protection levels) and the probabilistic nature of reliability complicates the choice of which hardened storage element to substitute where. This paper develops an analytical model for hardened latch and flip-flop design space exploration. It is shown that the best hardened design depends strongly on the target protection level and the chip that is being protected.

All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory

Increasing transfer rates and decreasing I/O voltage levels make signals more vulnerable to transmission errors. While the data in computer memory are well-protected by modern error checking and correcting (ECC) codes, the clock, control, command, and address (CCCA) signals are weakly protected or even unprotected such that transmission errors leave serious gaps in data-only protection. This paper presents All-Inclusive ECC (AIECC), a memory protection scheme that leverages and augments data ECC to also thoroughly protect CCCA signals.

S-Step and Communication-Avoiding Iterative Methods

In this paper we make an overview of s-step Conjugate Gradient (CG) and develop a novel formulation for s-step BiConjugate Gradient Stabilized (BiCGStab) iterative method. Also, we show how to add preconditioning to both of these s-step schemes. We explain their relationship to the standard, block and communication-avoiding counterparts.

Parallel Spectral Graph Partitioning

In this paper we develop a novel parallel spectral partitioning method that takes advantage of an efficient implementation of a preconditioned eigenvalue solver and a k-means algorithm on the GPU. We showcase the performance of our novel scheme against standard spectral techniques. Also, we use it to compare the ratio and normalized cut cost functions often used to measure the quality of graph partitioning.

Accelerated Generative Models for 3D Point Cloud Data

Finding meaningful, structured representations of 3D point cloud data (PCD) has become a core task for spatial perception applications. In this paper we introduce a method for constructing compact generative representations of PCD at multiple levels of detail. As opposed to deterministic structures such as voxel grids or octrees, we propose probabilistic subdivisions of the data through local mixture modeling, and show how these subdivisions can provide a maximum likelihood segmentation of the data.

Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks

Automatic detection and classification of dynamic hand gestures in real-world systems intended for human computer interaction is challenging as: 1) there is a large diversity in how people perform gestures, making detection and classification difficult; 2) the system must work online in order to avoid noticeable lag between performing a gesture and its classification; in fact, a negative lag (classification even before the gesture is finished) is desirable, as the feedback to the user can then be truly instantaneous.

Real-time Rendering of Procedural Multiscale Materials

We present a stable shading method and a procedural shading model that enables real-time rendering of sub-pixel glints and anisotropic microdetails resulting from irregular microscopic surface structure to simulate a rich spectrum of appearances ranging from sparkling to brushed materials. We introduce a biscale Normal Distribution Function (NDF) for microdetails to provide a convenient artistic control over both the global appearance as well as over the appearance of the individual microdetail shapes, while efficiently generating procedural details.

AmgX: A Library for GPU Accelerated Algebraic Multigrid and Preconditioned Iterative Methods

The solution of large sparse linear systems arises in many applications, such as computational fluid dynamics and oil reservoir simulation. In realistic cases the matrices are often so large that they require large scale distributed parallel computing to obtain the solution of interest in a reasonable time. In this paper we discuss the design and implementation of the AmgX library, which provides drop-in GPU acceleration of distributed algebraic multigrid and preconditioned iterative methods.