Camera-to-Robot Pose Estimation from a Single Image

We present an approach for estimating the pose of an external camera with respect to a robot using a single RGB image of the robot. The image is processed by a deep neural network to detect 2D projections of keypoints (such as joints) associated with the robot. The network is trained entirely on simulated data using domain randomization to bridge the reality gap. Perspective-n-point (PnP) is then used to recover the camera extrinsics, assuming that the camera intrinsics and joint configuration of the robot manipulator are known.

Jean Kossaifi

Jean Kossaifi leads research at NVIDIA in the field of AI for Scientific Simulation, where he advances new algorithmic paradigms to solve complex physics-based problems. His core research focuses on fundamental algorithms, including combining tensor methods with deep learning, to develop efficient and powerful neural architectures.

Wen-mei Hwu

Wen-mei Hwu joined NVIDIA in February 2020 as Senior Distinguished Research Scientist, after spending 32 years at the University of Illinois at Urbana-Champaign, where he was a Professor, Sanders-AMD Endowed Chair, Acting Department Head and Chief Scientist of the Parallel Computing Institute. Hwu and his Illinois team developed the superblock compiler scheduling and optimization framework that has been adopted by virtually all modern vendor and open-source compilers today.  In 2008, Hwu became the director of NVIDIA's first CUDA Center of Excellence.

Hang Su

Hang Su is a research scientist in the Learning and Perception Research (LPR) team of NVIDIA Research. He completed his Ph.D. study in the Computer Vision Lab at UMass Amherst, advised by Prof. Erik Learned-Miller. He obtained his master's degree in Computer Science from Brown University and his bachelor's degree in Intelligent Science and Technology from Peking University.

Toward Standardized Classification of Foveated Displays

Emergent in the field of head mounted display design is a desire to leverage the limitations of the human visual system to reduce the computation, communication, and display workload in power and form-factor constrained systems. Fundamental to this reduced workload is the ability to match display resolution to the acuity of the human visual system, along with a resulting need to follow the gaze of the eye as it moves, a process referred to as foveation.

Two-shot Spatially-varying BRDF and Shape Estimation

Capturing the shape and spatially-varying appearance (SVBRDF) of an object from images is a challenging task that has applications in both computer vision and graphics. Traditional optimization-based approaches often need a large number of images taken from multiple views in a controlled environment. Newer deep learning-based approaches require only a few input images, but the reconstruction quality is not on par with optimization techniques. We propose a novel deep learning architecture with a stage-wise estimation of shape and SVBRDF.

Novel View Synthesis of Dynamic Scenes with Globally Coherent Depths

This paper presents a new method to synthesize an image from the arbitrary view and time given a collection of images of a dynamic scene. A key challenge for the synthesis arises from dynamic scene reconstruction where epipolar geometry does not apply to the local motion of dynamic contents. Our insight is that although its scale and quality is inconsistent with other views, the depth estimation from a single view can be used to reason about the geometry of the local motion.

Prescription AR: a fully-customized prescription-embedded augmented reality display

In this paper, we present a fully-customized AR display design that considers the user’s prescription, interpupillary distance, and taste of fashion. A free-form image combiner embedded inside the prescription lens provides augmented images onto the vision-corrected real world. The optics was optimized for each prescription level, which can reduce the mass production cost while satisfying the user’s taste. The foveated optimization method was applied which distributes the pixels in accordance with human visual acuity.

Patch scanning displays: spatiotemporal enhancement for displays

Emerging fields of mixed reality and electronic sports necessitate greater spatial and temporal resolutions in displays. We introduce a novel scanning display method that enhances spatiotemporal qualities of displays. Specifically, we demonstrate that scanning multiple image patches that are representing basis functions of each block in a target image can help to synthesize spatiotemporally enhanced visuals. To discover the right image patches, we introduce an optimization framework tailored to our hardware.