Neural Temporal Adaptive Sampling and Denoising

Despite recent advances in Monte Carlo path tracing at interactive rates, denoised image sequences generated with few samples per-pixel often yield temporally unstable results and loss of high-frequency details. We present a novel adaptive rendering method that increases temporal stability and image fidelity of low sample count path tracing by distributing samples via spatio-temporal joint optimization of sampling and denoising.

Self-Supervised Viewpoint Learning From Image Collections

Training deep neural networks to estimate the viewpoint of objects requires large labeled training datasets. However, manually labeling viewpoints is notoriously hard, error-prone, and time-consuming. On the other hand, it is relatively easy to mine many unlabeled images of an object category from the internet, e.g., of cars or faces. We seek to answer the research question of whether such unlabeled collections of in-the-wild images can be successfully utilized to train viewpoint estimation networks for general object categories purely via self-supervision.

Tucker Hermans

Tucker Hermans is a senior research scientist at NVIDIA working on robotic manipulation focusing on the interaction of multi-sensory perception, learning, and control. Tucker is also an assosciate professor in the School of Computing at the University of Utah where he is a member of the Utah Robotics Center. Tucker earned his Ph.D. in Robotics from the Georgia Institute of Technology.

Convolutional Tensor-Train LSTM for Spatio-Temporal Learning

Learning from spatio-temporal data has numerous applications such as human-behavior analysis, object tracking, video compression, and physics simulation. However, existing methods still perform poorly on challenging video tasks suchas long-term forecasting. The gap partially is because these kinds of challenging tasks require learning long-term spatio-temporal correlations in the video sequence. We propose a higher-order convolutional LSTM model that can efficiently learn these correlations with a succinct representation of the history.

Toward Sim-to-Real Directional Semantic Grasping

We address the problem of directional semantic grasping, that is, grasping a specific object from a specific direction. We approach the problem using deep reinforcement learning via a double deep Q-network (DDQN) that learns to map downsampled RGB input images from a wrist-mounted camera to Q-values, which are then translated into Cartesian robot control commands via the cross-entropy method (CEM). The network is learned entirely on simulated data generated by a custom robot simulator that models both physical reality (contacts) and perceptual quality (high-quality rendering).

Camera-to-Robot Pose Estimation from a Single Image

We present an approach for estimating the pose of an external camera with respect to a robot using a single RGB image of the robot. The image is processed by a deep neural network to detect 2D projections of keypoints (such as joints) associated with the robot. The network is trained entirely on simulated data using domain randomization to bridge the reality gap. Perspective-n-point (PnP) is then used to recover the camera extrinsics, assuming that the camera intrinsics and joint configuration of the robot manipulator are known.

Jean Kossaifi

Jean Kossaifi is a Senior Research Scientist at NVIDIA. Prior to this, he was a Research Scientist at the Samsung AI Center in Cambridge. He has worked extensively on face analysis and facial affect estimation in naturalistic conditions, a field which bridges the gap between computer vision and machine learning. His current focus is tensor methods for machine learning.