| Research

Stan Birchfield

Stan Birchfield is a Principal Research Scientist and Senior Research Manager, exploring the intersection of computer vision and robotics. Prior to joining NVIDIA, he was a tenured professor at Clemson University, where he led research in computer vision, visual tracking, mobile robotics, robotic manipulation, and the perception of highly deformable objects. He remains an adjunct faculty member at Clemson.

Read more about Stan Birchfield

Thomas Breuel

Read more about Thomas Breuel

Page Placement Strategies for GPUs within Heterogeneous Memory Systems

Systems from smartphones to supercomputers are increasingly heterogeneous, being composed of both CPUs and GPUs. To maximize cost and energy efficiency, these systems will increasingly use globally-addressable heterogeneous memory systems, making choices about memory page placement critical to performance. In this work we show that current page placement policies are not sufficient to maximize GPU performance in these heterogeneous memory systems. We propose two new page placement policies that improve GPU performance: one application agnostic and one using application profile information.

Read more about Page Placement Strategies for GPUs within Heterogeneous Memory Systems

Unlocking Bandwidth for GPUs in CC-NUMA systems

Historically, GPU-based HPC applications have had a substantial memory bandwidth advantage over CPU-based workloads due to using GDDR rather than DDR memory. However, past GPUs required a restricted programming model where application data was allocated up front and explicitly copied into GPU memory before launching a GPU kernel by the programmer. Recently, GPUs have eased this requirement and now can employ on-demand software page migration between CPU and GPU memory to obviate explicit copying.

Read more about Unlocking Bandwidth for GPUs in CC-NUMA systems

Flexible Software Profiling of GPU Architectures

To aid application characterization and architecture design space exploration, researchers and engineers have developed a wide range of tools for CPUs, including simulators, profilers, and binary instrumentation tools. With the advent of GPU computing, GPU manufacturers have developed similar tools leveraging hardware profiling and debugging hooks. To date, these tools are largely limited by the fixed menu of options provided by the tool developer and do not offer the user the flexibility to observe or act on events not in the menu.

Read more about Flexible Software Profiling of GPU Architectures

A Variable Warp Size Architecture

This paper studies the effect of warp sizing and scheduling on performance and efficiency in GPUs. We propose Variable Warp Sizing (VWS) which improves the performance of divergent applications by using a small base warp size in the presence of control flow and memory divergence. When appropriate, our proposed technique groups sets of these smaller warps together by ganging their execution in the warp scheduler, improving performance and energy efficiency for regular applications.

Read more about A Variable Warp Size Architecture

Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems

Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer to the DRAM layers with high bandwidth connections. Recent work has shown promising potential performance benefits from an architecture that connects multiple such 3D-stacked memories and offloads bandwidth-intensive computations to a GPU in each of the logic layers.

Read more about Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems

CLARA: Circular Linked-List Auto- and Self-Refresh Architecture

With increasing DRAM densities, the performance and energy overheads of refresh operations are increasingly significant. When the system is active, refresh commands render DRAM banks unavailable for increasing periods of time. These refresh operations can interfere with regular memory operations and hurt performance. In addition, when the system is idle, DRAM self-refresh is the dominant source of energy consumption, and it directly impacts battery life and standby time.

Read more about CLARA: Circular Linked-List Auto- and Self-Refresh Architecture

Designing Efficient Heterogeneous Memory Architectures

The authors' model of energy, bandwidth, and latency for DRAM technologies enables exploration of memory hierarchies that combine heterogeneous memory technologies with different attributes. Analysis shows that the gap between on- and off-package DRAM technologies is narrower than that found between cache layers in traditional memory hierarchies. Thus, heterogeneous memory caches must achieve high hit rates or risk degrading both system energy and bandwidth efficiency.