Thrust: A Productivity-Oriented Library for CUDA

This chapter demonstrates how to leverage the Thrust parallel template library to implement high-performance applications with minimal programming effort. Based on the C++ Standard Template Library (STL), Thrust brings a familiar high-level interface to the realm of GPU Computing while remaining fully interoperable with the rest of the CUDA software ecosystem. Applications written with Thrust are concise, readable, and efficient.

High-Performance Software Rasterization on GPUs

In this paper, we implement an efficient, completely software-based graphics pipeline on a GPU. Unlike previous approaches, we obey ordering constraints imposed by current graphics APIs, guarantee hole-free rasterization, and support multisample antialiasing. Our goal is to examine the performance implications of not exploiting the fixed-function graphics pipeline, and to discern which additional hardware support would benefit software-based graphics the most. We present significant improvements over previous work in terms of scalability, performance, and capabilities.

Stratified Sampling for Stochastic Transparency

The traditional method of rendering semi-transparent surfaces using alpha blending requires sorting the surfaces in depth order. There are several techniques for order-independent transparency, but most require either unbounded storage or can be fragile due to forced compaction of information during rendering. Stochastic transparency works in a fixed amount of storage and produces results with the correct expected value. However, carelessly chosen sampling strategies easily result in high variance of the final pixel colors, showing as noise in the image.

The Workflow Scale: Why 5x Faster Might Not Be Enough

This essay discusses qualitative versus quantitative accelerations of user tasks, in the context of computer animation production. A workflow regime is defined as a range of system response times in which the artist's relationship to the task is qualitatively similar. Radical new technology is much more likely to succeed when it brings an artist's workflow into a new regime, providing a discontinuous improvement in efficiency and final image quality.

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods

Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughput-oriented processors, such as the GPU, demands algorithms with abundant fine-grained parallelism.

A Hybrid Method for Solving Tridiagonal Systems on the GPU

Tridiagonal linear systems are of importance to many problems in numerical analysis and computational fluid dynamics, as well as to computer graphics applications in video games and computer-animated films. Typical applications require solving hundreds or thousands of tridiagonal systems, which takes a majority part of total computation time. Fast parallel solutions are critical to larger scientific simulations, interactive computations of special effects in films, and real-time applications in video games.

Processing Device Arrays with C++ Metaprogramming

In this chapter, I will explain how C++ metaprogramming techniques can be used to simplify the development of CUDA-based libraries. The chapter will walk through the design of a high-level library to process arrays in CUDA. For illustrative purposes, I will describe this library in the context of numerical solutions of partial differential equations. The basic technique can be useful in a wide variety of application domains such as image processing, agent modeling, particle systems, or other data-parallel routines.

Simpler and Faster HLBVH with Work Queues

A recently developed algorithm called Hierachical Linear Bounding Volume Hierarchies (HLBVH) has demonstrated the feasibility of reconstructing the spatial index needed for ray tracing in real-time, even in the presence of millions of fully dynamic triangles. In this work we present a simpler and faster variant of HLBVH, where all the complex bookkeeping of pre x sums, compaction and partial breadth- rst tree traversal needed for spatial partitioning has been replaced with an elegant pipeline built on top of ecient work queues and binary search.

VoxelPipe: A Programmable Pipeline for 3D Voxelization

We present a highly exible and efficient software pipeline for programmable triangle voxelization. The pipeline, entirely written in CUDA, supports both fully conservative and thin voxelizations, multiple boolean, floating point, vector-typed render targets, user-defined vertex and fragment shaders, and a bucketing mode which can be used to generate 3D A-buffers containing the entire list of fragments belonging to each voxel.

Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU

A novel algorithm for solving in parallel a sparse triangular linear system on a graphical processing unit is proposed. It implements the solution of the triangular system in two phases. First, the analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the solve phase obtains the full solution by iterating sequentially across the constructed levels. The solution elements corresponding to each single level are obtained at once in parallel.