Integral Equations and Machine Learning

As both light transport simulation and reinforcement learning are ruled by the same Fredholm integral equation of the second kind, machine learning techniques can be used for efficient photorealistic image synthesis: Light transport paths are guided by an approximate solution to the integral equation that is learned during rendering.

Beyond the Socket: NUMA-Aware GPUs

GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available.

Zhiding Yu

Zhiding Yu is a senior research scientist at NVIDIA Research. Before joining NVIDIA, he received his Ph.D. in ECE from Carnegie Mellon University in 2017. His research interests mainly focus on deep representation learning, weakly/self-supervised learning, transfer learning, and deep structured prediction, with their applications to visual recognition and general CV problems. He served as an area chair for NeurIPS 2021/2022 and WACV 2023.

Ben Boudaoud

Ben joined NVIDIA in January 2018 as research staff in the New Experiences Research group. Prior to joining NVIDIA he worked on ultra-low power circuit and system design for medical products including wearable and implantable monitors for the cardiac space. He received his MS from the University of Virginia in 2014 where his work focused on development and deployment of wearable 6 and 9 DoF motion sensing platforms for clinical applications.

Sifei Liu

Sifei Liu is a senior research scientist at Nvidia Research in Santa Clara, US. She received her PhD from the University of California Merced, department of EECS, where she was advised by Prof. Ming-Hsuan Yang. Before that, she obtained her master’s in ECE from University of Science and Technology of China (USTC), under the supervision of Prof. Stan.Z Li and Prof. Bin Li, and bachelor’s in control science from North China Electric Power University (NCEPU).

Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs

3D-stacked memory devices with processing logic can help alleviate the memory bandwidth bottleneck in GPUs. However, in order for such Near-Data Processing (NDP) memory stacks to be used for different GPU architectures, it is desirable to standardize the NDP architecture. Our proposal enables this standardization by allowing data to be spread across multiple memory stacks as is the norm in high-performance systems without an MMU on the NDP stack.

Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems

Future GPUs and other high-performance throughput processors will require multiple TB/s of bandwidth to DRAM. Satisfying this bandwidth demand within an acceptable energy budget is a challenge in these extreme bandwidth memory systems. We propose a new high-bandwidth DRAM architecture, Fine-Grained DRAM (FGDRAM), which improves bandwidth by 4× and improves the energy efficiency of DRAM by 2× relative to the highest-bandwidth, most energy-efficient contemporary DRAM, High Bandwidth Memory (HBM2).

Learning Adaptive Parameter Tuning for Image Processing

The non-stationary nature of image characteristics calls for adaptive processing, based on the local image content. We propose a simple and flexible method to learn local tuning of parameters in adaptive image processing: we extract simple local features from an image and learn the relation between these features and the optimal filtering parameters. Learning is performed by optimizing a user defined cost function (any image quality metric) on a training set.

Parallel Complexity of Forward and Backward Propagation

We show that the forward and backward propagation can be formulated as a solution of lower and upper triangular systems of equations. For standard feedforward (FNNs) and recurrent neural networks (RNNs) the triangular systems are always block bi-diagonal, while for a general computation graph (directed acyclic graph) they can have a more complex triangular sparsity pattern. We discuss direct and iterative parallel algorithms that can be used for their solution and interpreted as different ways of performing model parallelism. Also, we show that for FNNs and RNNs with k layers and t time steps the backward propagation can be performed in parallel in O(log k) and O(log k log t) steps, respectively. Finally, we outline the generalization of this technique using Jacobians that potentially allows us to handle arbitrary layers.