GPU Computing

The graphics processing unit (GPU) has become an integral part of today's mainstream computing systems. Over the past six years, there has been a marked increase in the performance and capabilities of GPUs. The modern GPU is not only a powerful graphics engine but also a highly-parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart.

Real-time Editing and Relighting of Homogeneous Translucent Materials

Existing techniques for fast, high-quality rendering of translucent materials often fix BSSRDF parameters at precomputation time. We present a novel method for accurate rendering and relighting of translucent materials that also enables real-time editing and manipulation of homogeneous diffuse BSSRDFs. We first apply PCA analysis on diffuse multiple scattering to derive a compact basis set, consisting of only twelve 1D functions. We discovered that this small basis set is accurate enough to approximate a general diffuse scattering profile.

Robust Stereo with Flash and No-flash Image Pairs

We propose a new stereo technique using a pair of flash and no-flash stereo images that is both efficient and robust in handling occlusion boundaries.  Our work is motivated by the observation that the brightness variations introduced by the flash can provide a robust cue for establishing stereo matches at occlusion boundaries.  This photometric cue is computed per pixel, and though on its own is not robust to reliably resolve depth, it can provide a new discriminant to support patch-based stereo matching algorithms.

Scalable Ambient Obscurance

This paper presents a set of architecture-aware performance and integration improvements for a recent screen-space ambient obscurance algorithm. These improvements collectively produce a 7x performance increase at 2560x1600, generalize the algorithm to both forward and deferred renderers, and eliminate the radius- and scene-dependence of the previous algorithm to provide a hard real-time guarantee of fixed execution time.

Understanding the Efficiency of Ray Traversal on GPUs - Kepler and Fermi Addendum

This technical report is an addendum to the HPG2009 paper "Understanding the Efficiency of Ray Traversal on GPUs", and provides citable performance results for Kepler and Fermi architectures. We explain how to optimize the traversal and intersection kernels for these newer platforms, and what the important architectural limiters are.

Relational Algorithms for Multi-Bulk-Synchronous Processors

Relational databases remain an important application domain for organizing and analyzing the massive volume of data generated as sensor technology, retail and inventory transactions, social media, computer vision, and new fields continue to evolve. At the same time, processor architectures are beginning to shift towards hierarchical and parallel architectures employing throughput-optimized memory systems, lightweight multi-threading, and Single-Instruction Multiple-Data (SIMD) core organizations.

Detecting Regions of Interest in Dynamic Scenes with Camera Motions

We present a method to detect the regions of interests in moving camera views of dynamic scenes with multiple moving objects. We start by extracting a global motion tendency that reflects the scene context by tracking movements of objects in the scene. We then use Gaussian process regression to represent the extracted motion tendency as a stochastic vector field. The generated stochastic field is robust to noiseand can handle a video from an uncalibrated moving camera. We use the stochastic field for predicting important future regions of interest as the scene evolves dynamically.

Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees

A number of methods for constructing bounding volume hierarchies and point-based octrees on the GPU are based on the idea of ordering primitives along a space-filling curve. A major shortcoming with these methods is that they construct levels of the tree sequentially, which limits the amount of parallelism that they can achieve. We present a novel approach that improves scalability by constructing the entire tree in parallel. Our main contribution is an in-place algorithm for constructing binary radix trees, which we use as a building block for other types of trees.

Incomplete-LU and Cholesky Factorization in the Preconditioned Iterative Methods on the GPU

A novel algorithm for computing the incomplete-LU and Cholesky factorization with 0 fill-in on a graphics processing unit (GPU) is proposed. It implements the incomplete factorization of the given matrix in two phases. First, the symbolic analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the numerical factorization phase obtains the resulting lower and upper sparse triangular factors by iterating sequentially across the constructed levels.