Beyond the socket: NUMA-aware GPUs

GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available.

Ben Boudaoud

Ben joined NVIDIA in January 2018 as research staff in the New Experiences Research group. Prior to joining NVIDIA he worked on ultra-low power circuit and system design for medical products including wearable and implantable monitors for the cardiac space. He received his MS from the University of Virginia in 2014 where his work focused on development and deployment of wearable 6 and 9 DoF motion sensing platforms for clinical applications.

Ben's research interests include techniques for low power, high efficiency circuit and system design as well as applications of low power sensors and systems within the VR/AR space. 

Main Field of Interest: 

Sifei Liu

Sifei Liu's research interests are in computer vision and machine learning. Previously, she was a Ph.D student in VLLAB, computer science at the University of California, Merced, department of EECS under Prof. Ming-Hsuan Yang. She completed her M.C.S. at University of Science and Technology of China (USTC) under Stan.Z Li and Bin Li, and received her B.S. in control science and technology from North China Electric Power University.

Main Field of Interest: 

Ankur Handa

Additional Research Areas: 

Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs

3D-stacked memory devices with processing logic can help alleviate the memory bandwidth bottleneck in GPUs. However, in order for such Near-Data Processing (NDP) memory stacks to be used for different GPU architectures, it is desirable to standardize the NDP architecture. Our proposal enables this standardization by allowing data to be spread across multiple memory stacks as is the norm in high-performance systems without an MMU on the NDP stack.

Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems

Future GPUs and other high-performance throughput processors will require multiple TB/s of bandwidth to DRAM. Satisfying this bandwidth demand within an acceptable energy budget is a challenge in these extreme bandwidth memory systems. We propose a new high-bandwidth DRAM architecture, Fine-Grained DRAM (FGDRAM), which improves bandwidth by 4× and improves the energy efficiency of DRAM by 2× relative to the highest-bandwidth, most energy-efficient contemporary DRAM, High Bandwidth Memory (HBM2).

Learning Adaptive Parameter Tuning for Image Processing

The non-stationary nature of image characteristics calls for adaptive processing, based on the local image content. We propose a simple and flexible method to learn local tuning of parameters in adaptive image processing: we extract simple local features from an image and learn the relation between these features and the optimal filtering parameters. Learning is performed by optimizing a user defined cost function (any image quality metric) on a training set.

Parallel Complexity of Forward and Backward Propagation

We show that the forward and backward propagation can be formulated as a solution of lower and upper triangular systems of equations. For standard feedforward (FNNs) and recurrent neural networks (RNNs) the triangular systems are always block bi-diagonal, while for a general computation graph (directed acyclic graph) they can have a more complex triangular sparsity pattern. We discuss direct and iterative parallel algorithms that can be used for their solution and interpreted as different ways of performing model parallelism. Also, we show that for FNNs and RNNs with k layers and t time steps the backward propagation can be performed in parallel in O(log k) and O(log k log t) steps, respectively. Finally, we outline the generalization of this technique using Jacobians that potentially allows us to handle arbitrary layers.

Pages

Subscribe to Research RSS