Policy-based Tuning for Performance Portability and Library Co-optimization

Although modular programming is a fundamental software development practice, software reuse within contemporary GPU kernels is uncommon. For GPU software assets to be reusable across problem instances, they must be inherently flexible and tunable. To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture, problem size, and data type).

Efficient Parallel Merge Sort for Fixed and Variable Length Keys

We design a high-performance parallel merge sort for highly parallel systems. Our merge sort is designed to use more register communication (not shared memory), and does not suffer from over-segmentation as opposed to previous comparison based sorts. Using these techniques we are able to achieve a sorting rate of 250 MKeys/sec, which is about 2.5 times faster than Thrust merge sort performance, and 70% faster than non-stable state-of-the-art GPU merge sorts.

Simulation of Bubbles in Foam with the Volume Control Method

Liquid and gas interactions often produce bubbles that stay for a long time without bursting on the surface, making a dry foam struc- ture. Such long lasting bubbles simulated by the level set method can suffer from a small but steady volume error that accumulates to a visible amount of volume change. We propose to address this problem by using the volume control method. We track the vol- ume change of each connected region, and apply a carefully com- puted divergence that compensates undesired volume changes.

Twister: A Space-warp Operator for the Two-handed Editing of 3D Shapes

A free-form deformation that warps a surface or solid may be spec- ied in terms of one or several point-displacement constraints that must be interpolated by the deformation. The Twister approach in- troduced here, adds the capability to impose an orientation change, adding three rotational constraints, at each displaced point. Further- more, it solves for a space warp that simultaneously interpolates two sets of such displacement and orientation constraints.

A Compile-Time Managed Multi-Level Register File Hierarchy

As processors increasingly become power limited, performance improvements will be achieved by rearchitecting systems with energy efficiency as the primary design constraint. While some of these optimizations will be hardware based, combined hardware and software techniques likely will be the most productive. This work redesigns the register file system of a modern throughput processor with a combined hardware and software solution that reduces register file energy without harming system performance.

Sparse Matrix-Vector Multiplication on Multicore and Accelerators

This chapter summarizes recent work on the development of highperformance multicore and accelerator-based implementations of sparse matrix-vector multiplication (SpMV). As an object of study, SpMV is an interesting computation for at least two reasons. First, it appears widely in applications in scienti c and engineering computing, nancial and economic modeling, and information retrieval, among others, and is therefore of great practical interest.

Allocation-oriented Algorithm Design with Application to GPU Computing, Ph.D. Dissertation

The wide data-parallelism of GPU processor design facilitates the execution of many concurrent, fine-grained tasks with unprecedented performance and efficiency. However, contemporary opinion regards GPU architecture as being poorly suited for many parallelizations that impose dynamic dependences between processing elements. This dissertation specifically attends to problems whose efficient solutions require cooperative allocation, i.e., their dependences stem from dynamic data placement within shared structures.

Scalable GPU Graph Traversal

Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.

Gaussian Process Regression Flow for Analysis of Motion Trajectories

Recognition of motions and activities of objects in videos requires effective representations for analysis and matching of motion trajectories. In this paper, we introduce a new representation specifically aimed at matching motion trajectories. We model a trajectory as a continuous dense flow field from a sparse set of vector sequences using Gaussian Process Regression. Furthermore, we introduce a random sampling strategy for learning stable classes of motions from limited data.