Convergence and Scalarization for Data-Parallel Architectures
Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One drawback of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess instruction dispatch, register file accesses, and memory operations. This paper proposes to alleviate these overheads while retaining the threaded programming model by automatically detecting the scalar operations and factoring them out of the parallel code. We have developed a scalarizing compiler that employs convergence and variance analyses to statically identify values and instructions that are invariant across multiple threads. Our compiler algorithms are effective at identifying convergent execution even in programs with arbitrary control flow, identifying two-thirds of the opportunity captured by a dynamic oracle. The compile-time analysis leads to a reduction in instructions dispatched by 29%, register file reads and writes by 31%, memory address counts by 47%, and data access counts by 38%.
This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to email@example.com.