Self-Supervised Viewpoint Learning From Image Collections

Training deep neural networks to estimate the viewpoint of objects requires large labeled training datasets. However, manually labeling viewpoints is notoriously hard, error-prone, and time-consuming. On the other hand, it is relatively easy to mine many unlabeled images of an object category from the internet, e.g., of cars or faces. We seek to answer the research question of whether such unlabeled collections of in-the-wild images can be successfully utilized to train viewpoint estimation networks for general object categories purely via self-supervision.

Tucker Hermans

Tucker Hermans is a senior research scientist at NVIDIA working on robotic manipulation focusing on the interaction of multi-sensory perception, learning, and control. Tucker is also an assosciate professor in the School of Computing at the University of Utah where he is a member of the Utah Robotics Center. Tucker earned his Ph.D. in Robotics from the Georgia Institute of Technology.

Convolutional Tensor-Train LSTM for Spatio-Temporal Learning

Learning from spatio-temporal data has numerous applications such as human-behavior analysis, object tracking, video compression, and physics simulation. However, existing methods still perform poorly on challenging video tasks suchas long-term forecasting. The gap partially is because these kinds of challenging tasks require learning long-term spatio-temporal correlations in the video sequence. We propose a higher-order convolutional LSTM model that can efficiently learn these correlations with a succinct representation of the history.

Toward Sim-to-Real Directional Semantic Grasping

We address the problem of directional semantic grasping, that is, grasping a specific object from a specific direction. We approach the problem using deep reinforcement learning via a double deep Q-network (DDQN) that learns to map downsampled RGB input images from a wrist-mounted camera to Q-values, which are then translated into Cartesian robot control commands via the cross-entropy method (CEM). The network is learned entirely on simulated data generated by a custom robot simulator that models both physical reality (contacts) and perceptual quality (high-quality rendering).

Camera-to-Robot Pose Estimation from a Single Image

We present an approach for estimating the pose of an external camera with respect to a robot using a single RGB image of the robot. The image is processed by a deep neural network to detect 2D projections of keypoints (such as joints) associated with the robot. The network is trained entirely on simulated data using domain randomization to bridge the reality gap. Perspective-n-point (PnP) is then used to recover the camera extrinsics, assuming that the camera intrinsics and joint configuration of the robot manipulator are known.

Jean Kossaifi

Jean Kossaifi leads research at NVIDIA in the field of AI for Scientific Simulation, where he advances new algorithmic paradigms to solve complex physics-based problems. His core research focuses on fundamental algorithms, including combining tensor methods with deep learning, to develop efficient and powerful neural architectures.

Wen-mei Hwu

Wen-mei Hwu joined NVIDIA in February 2020 as Senior Distinguished Research Scientist, after spending 32 years at the University of Illinois at Urbana-Champaign, where he was a Professor, Sanders-AMD Endowed Chair, Acting Department Head and Chief Scientist of the Parallel Computing Institute. Hwu and his Illinois team developed the superblock compiler scheduling and optimization framework that has been adopted by virtually all modern vendor and open-source compilers today.  In 2008, Hwu became the director of NVIDIA's first CUDA Center of Excellence.

Hang Su

Hang Su is a research scientist in the Learning and Perception Research (LPR) team of NVIDIA Research. He completed his Ph.D. study in the Computer Vision Lab at UMass Amherst, advised by Prof. Erik Learned-Miller. He obtained his master's degree in Computer Science from Brown University and his bachelor's degree in Intelligent Science and Technology from Peking University.

Toward Standardized Classification of Foveated Displays

Emergent in the field of head mounted display design is a desire to leverage the limitations of the human visual system to reduce the computation, communication, and display workload in power and form-factor constrained systems. Fundamental to this reduced workload is the ability to match display resolution to the acuity of the human visual system, along with a resulting need to follow the gaze of the eye as it moves, a process referred to as foveation.

Two-shot Spatially-varying BRDF and Shape Estimation

Capturing the shape and spatially-varying appearance (SVBRDF) of an object from images is a challenging task that has applications in both computer vision and graphics. Traditional optimization-based approaches often need a large number of images taken from multiple views in a controlled environment. Newer deep learning-based approaches require only a few input images, but the reconstruction quality is not on par with optimization techniques. We propose a novel deep learning architecture with a stage-wise estimation of shape and SVBRDF.