Allocation-oriented Algorithm Design with Application to GPU Computing, Ph.D. Dissertation
The wide data-parallelism of GPU processor design facilitates the execution of many concurrent, fine-grained tasks with unprecedented performance and efficiency. However, contemporary opinion regards GPU architecture as being poorly suited for many parallelizations that impose dynamic dependences between processing elements. This dissertation specifically attends to problems whose efficient solutions require cooperative allocation, i.e., their dependences stem from dynamic data placement within shared structures. The contribution of this thesis is a set of design patterns and reusable, tunable software primitives that foster the construction of cooperative, allocation-oriented parallelizations that are significantly faster, more efficient, more flexible, and more performance-portable than the state of the art.
Whereas concurrent CPU programs typically leverage fine-grained atomic operations for coordinating data placement, we demonstrate the advantages of parallel prefix sum as a high performance, bulk-synchronous alternative for cooperative allocation. We construct very efficient algorithms for global and local prefix sum by employing flexible granularity coarsening techniques for properly balancing serial and parallel workloads. The resulting efficiency gains permit kernel fusion techniques whereby application-specific logic can be incorporated within our prefix sum constructs with little or no overhead.
To demonstrate the viability of our methods, we construct cooperative GPU implementations for a variety of parallel list-processing primitives including reduction, prefix scan, duplicate removal, histogram, and reduce-by-key. We evaluate their performance across a wide spectrum of problem sizes, types, and target architectures. To achieve performance-portability, we present a policy-based autotuning design idiom in which we leave implementation decisions having opaque performance consequences unbound within the program text. Finally, we showcase high performance implementations for two archetypal problems in computer science: sorting and breadth-first search (BFS). We achieve excellent performance for both genres, demonstrating multiple factors of speedup over prior parallelizations for GPU and conventional multi-core platforms.