Processing Device Arrays with C++ Metaprogramming

In this chapter, I will explain how C++ metaprogramming techniques can be used to simplify the development of CUDA-based libraries. The chapter will walk through the design of a high-level library to process arrays in CUDA. For illustrative purposes, I will describe this library in the context of numerical solutions of partial differential equations. The basic technique can be useful in a wide variety of application domains such as image processing, agent modeling, particle systems, or other data-parallel routines.

Simpler and Faster HLBVH with Work Queues

A recently developed algorithm called Hierachical Linear Bounding Volume Hierarchies (HLBVH) has demonstrated the feasibility of reconstructing the spatial index needed for ray tracing in real-time, even in the presence of millions of fully dynamic triangles. In this work we present a simpler and faster variant of HLBVH, where all the complex bookkeeping of pre x sums, compaction and partial breadth- rst tree traversal needed for spatial partitioning has been replaced with an elegant pipeline built on top of ecient work queues and binary search.

VoxelPipe: A Programmable Pipeline for 3D Voxelization

We present a highly exible and efficient software pipeline for programmable triangle voxelization. The pipeline, entirely written in CUDA, supports both fully conservative and thin voxelizations, multiple boolean, floating point, vector-typed render targets, user-defined vertex and fragment shaders, and a bucketing mode which can be used to generate 3D A-buffers containing the entire list of fragments belonging to each voxel.

Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU

A novel algorithm for solving in parallel a sparse triangular linear system on a graphical processing unit is proposed. It implements the solution of the triangular system in two phases. First, the analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the solve phase obtains the full solution by iterating sequentially across the constructed levels. The solution elements corresponding to each single level are obtained at once in parallel.

The Alchemy Screen-space Ambient Obscurance Algorithm

Ambient obscurance (AO) produces perceptually important illumination effects such as darkened corners, cracks, and wrinkles; proximity darkening; and contact shadows. We present the AO algorithm from the Alchemy engine used at Vicarious Visions in commercial games. It is based on a new derivation of screen-space obscurance for robustness, and the insight that a falloff function can cancel terms in a visibility integral to favor efficient operations.

Decoupled Sampling for Graphics Pipelines

We propose a generalized approach to decoupling shading from visibility sampling in graphics pipelines, which we call decoupled sampling. Decoupled sampling enables stochastic supersampling of motion and defocus blur at reduced shading cost, as well as controllable or adaptive shading rates which trade off shading quality for performance. It can be thought of as a generalization of multisample antialiasing (MSAA) to support complex and dynamic mappings from visibility to shading samples, as introduced by motion and defocus blur and adaptive shading.

Clipless Dual-Space Bounds for Faster Stochastic Rasterization

We present a novel method for increasing the efficiency of stochastic rasterization of motion and defocus blur.

Contrary to earlier approaches, our method is efficient even with the low sampling densities commonly encountered in realtime rendering, while allowing the use of arbitrary sampling patterns for maximal image quality.

Our clipless dual-space formulation avoids problems with triangles that cross the camera plane during the shutter interval.

The method is also simple to plug into existing rendering systems.

Temporal Light Field Reconstruction for Rendering Distribution Effects

Traditionally, effects that require evaluating multidimensional integrals for each pixel, such as motion blur, depth of field, and soft shadows, suffer from noise due to the variance of the highdimensional integrand. In this paper, we describe a general reconstruction technique that exploits the anisotropy in the temporal light field and permits efficient reuse of samples between pixels, multiplying the effective sampling rate by a large factor.

Restart Trail for Stackless BVH Traversal

A ray cast algorithm utilizing a hierarchical acceleration structure needs to perform a tree traversal in the hierarchy. In its basic form, executing the traversal requires a stack that holds the nodes that are still to be processed. In some cases, such a stack can be prohibitively expensive to maintain or access, due to storage or memory bandwidth limitations. The stack can, however, be eliminated or replaced with a fixed-size buffer using so-called stackless or short stack algorithms.

Two Methods for Fast Ray-Cast Ambient Occlusion

Ambient occlusion has proven to be a useful tool for producing realistic images, both in offline rendering and interactive applications. In production rendering, ambient occlusion is typically computed by casting a large number of short shadow rays from each visible point, yielding unparalleled quality but long rendering times. Interactive applications typically use screen-space approximations which are fast but suffer from systematic errors due to missing information behind the nearest depth layer.