A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs.

Interactive Indirect Illumination Using Voxel Cone Tracing

Indirect illumination is an important element for realistic image synthesis, but its computation is expensive and highly dependent on the complexity of the scene and of the BRDF of the involved surfaces. While off-line computation and pre-baking can be acceptable for some cases, many applications (games, simulators, etc.) require real-time or interactive approaches to evaluate indirect illumination.

A Path Space Extension for Robust Light Transport Simulation

We propose a new sampling space for light transport simulation which allows to unify two popular algorithms with orthogonal strengths: unbiased Monte Carlo path integration and photon density estimation. Traditionally, unbiased Monte Carlo path integration has been considered the only approach for accurate light transport simulation. However, recent work in photon density estimation has demonstrated that there are several practical scene configurations where photon density estimation is more efficient and accurate.

Policy-based Tuning for Performance Portability and Library Co-optimization

Although modular programming is a fundamental software development practice, software reuse within contemporary GPU kernels is uncommon. For GPU software assets to be reusable across problem instances, they must be inherently flexible and tunable. To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture, problem size, and data type).

Efficient Parallel Merge Sort for Fixed and Variable Length Keys

We design a high-performance parallel merge sort for highly parallel systems. Our merge sort is designed to use more register communication (not shared memory), and does not suffer from over-segmentation as opposed to previous comparison based sorts. Using these techniques we are able to achieve a sorting rate of 250 MKeys/sec, which is about 2.5 times faster than Thrust merge sort performance, and 70% faster than non-stable state-of-the-art GPU merge sorts.

Simulation of Bubbles in Foam with the Volume Control Method

Liquid and gas interactions often produce bubbles that stay for a long time without bursting on the surface, making a dry foam struc- ture. Such long lasting bubbles simulated by the level set method can suffer from a small but steady volume error that accumulates to a visible amount of volume change. We propose to address this problem by using the volume control method. We track the vol- ume change of each connected region, and apply a carefully com- puted divergence that compensates undesired volume changes.

Twister: A Space-warp Operator for the Two-handed Editing of 3D Shapes

A free-form deformation that warps a surface or solid may be spec- ied in terms of one or several point-displacement constraints that must be interpolated by the deformation. The Twister approach in- troduced here, adds the capability to impose an orientation change, adding three rotational constraints, at each displaced point. Further- more, it solves for a space warp that simultaneously interpolates two sets of such displacement and orientation constraints.

A Compile-Time Managed Multi-Level Register File Hierarchy

As processors increasingly become power limited, performance improvements will be achieved by rearchitecting systems with energy efficiency as the primary design constraint. While some of these optimizations will be hardware based, combined hardware and software techniques likely will be the most productive. This work redesigns the register file system of a modern throughput processor with a combined hardware and software solution that reduces register file energy without harming system performance.

Sparse Matrix-Vector Multiplication on Multicore and Accelerators

This chapter summarizes recent work on the development of highperformance multicore and accelerator-based implementations of sparse matrix-vector multiplication (SpMV). As an object of study, SpMV is an interesting computation for at least two reasons. First, it appears widely in applications in scienti c and engineering computing, nancial and economic modeling, and information retrieval, among others, and is therefore of great practical interest.

Allocation-oriented Algorithm Design with Application to GPU Computing, Ph.D. Dissertation

The wide data-parallelism of GPU processor design facilitates the execution of many concurrent, fine-grained tasks with unprecedented performance and efficiency. However, contemporary opinion regards GPU architecture as being poorly suited for many parallelizations that impose dynamic dependences between processing elements. This dissertation specifically attends to problems whose efficient solutions require cooperative allocation, i.e., their dependences stem from dynamic data placement within shared structures.