Zhiding Yu

I am a principal research scientist and research lead at the Learning and Perception Research Group, NVIDIA Research. Before joining NVIDIA, I obtained Ph.D. in ECE from Carnegie Mellon University in 2017, and M.Phil. in ECE from The Hong Kong University of Science and Technology in 2012.

Ben Boudaoud

Ben joined NVIDIA in January 2018 as research staff in the New Experiences Research group. Prior to joining NVIDIA he worked on ultra-low power circuit and system design for medical products including wearable and implantable monitors for the cardiac space. He received his MS from the University of Virginia in 2014 where his work focused on development and deployment of wearable 6 and 9 DoF motion sensing platforms for clinical applications.

Sifei Liu

Sifei Liu is a senior research scientist at Nvidia Research in Santa Clara, US. She received her PhD from the University of California Merced, department of EECS, where she was advised by Prof. Ming-Hsuan Yang. Before that, she obtained her master’s in ECE from University of Science and Technology of China (USTC), under the supervision of Prof. Stan.Z Li and Prof. Bin Li, and bachelor’s in control science from North China Electric Power University (NCEPU).

Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs

3D-stacked memory devices with processing logic can help alleviate the memory bandwidth bottleneck in GPUs. However, in order for such Near-Data Processing (NDP) memory stacks to be used for different GPU architectures, it is desirable to standardize the NDP architecture. Our proposal enables this standardization by allowing data to be spread across multiple memory stacks as is the norm in high-performance systems without an MMU on the NDP stack.

Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems

Future GPUs and other high-performance throughput processors will require multiple TB/s of bandwidth to DRAM. Satisfying this bandwidth demand within an acceptable energy budget is a challenge in these extreme bandwidth memory systems. We propose a new high-bandwidth DRAM architecture, Fine-Grained DRAM (FGDRAM), which improves bandwidth by 4× and improves the energy efficiency of DRAM by 2× relative to the highest-bandwidth, most energy-efficient contemporary DRAM, High Bandwidth Memory (HBM2).

Learning Adaptive Parameter Tuning for Image Processing

The non-stationary nature of image characteristics calls for adaptive processing, based on the local image content. We propose a simple and flexible method to learn local tuning of parameters in adaptive image processing: we extract simple local features from an image and learn the relation between these features and the optimal filtering parameters. Learning is performed by optimizing a user defined cost function (any image quality metric) on a training set.

Parallel Complexity of Forward and Backward Propagation

We show that the forward and backward propagation can be formulated as a solution of lower and upper triangular systems of equations. For standard feedforward (FNNs) and recurrent neural networks (RNNs) the triangular systems are always block bi-diagonal, while for a general computation graph (directed acyclic graph) they can have a more complex triangular sparsity pattern. We discuss direct and iterative parallel algorithms that can be used for their solution and interpreted as different ways of performing model parallelism. Also, we show that for FNNs and RNNs with k layers and t time steps the backward propagation can be performed in parallel in O(log k) and O(log k log t) steps, respectively. Finally, we outline the generalization of this technique using Jacobians that potentially allows us to handle arbitrary layers.

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

Training deep neural networks with Stochastic Gradient Descent, or its variants, requires careful choice of both learning rate and batch size. While smaller batch sizes generally converge in fewer training epochs, larger batch sizes offer more parallelism and hence better computational efficiency. We have developed a new training approach that, rather than statically choosing a single batch size for all epochs, adaptively increases the batch size during the training process. Our method delivers the convergence rate of small batch sizes while achieving performance similar to large batch sizes.

Sim-to-Real Transfer of Accurate Grasping with Eye-In-Hand Observations and Continuous Control

In the context of deep learning for robotics, we show effective method of training a real robot to grasp a tiny sphere (1:37cm of diameter), with an original combination of system design choices. We decompose the end-to-end system into a vision module and a closed-loop controller module. The two modules use target object segmentation as their common interface. The vision module extracts information from the robot end-effector camera, in the form of a binary segmentation mask of the target.