Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have the remarkable ability to generate high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is highly correlated with open concepts in the real world. Text-image discriminative models like CLIP, on the other hand, are good at classifying images into open-vocabulary labels.

GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation

We present the Group Propagation Vision Transformer (GPViT): a novel non- hierarchical (i.e. non-pyramidal) transformer model designed for general visual recognition with high-resolution features. High-resolution features (or tokens) are a natural fit for tasks that involve perceiving fine-grained details such as detection and segmentation, but exchanging global information between these features is expensive in memory and computation because of the way self-attention scales. We provide a highly efficient alternative Group Propagation Block (GP Block) to exchange global information.

Less is More: Rendering for Esports

Computer graphics has improved from early wireframes to ray tracing and physically based rendering. Yet nearly 20 years ago, George Lucas stated that “the real leap has been made,” and today, esports players turn off many of the rendering techniques that took SIGGRAPH so long to develop, because they don't help them win. Is it time for SIGGRAPH to reconsider its research goals? This workshop will discuss this question, including alternatives to photorealism, trading off temporal and visual accuracy, and trading off realism with gameplay and fairness.

Mouse Sensitivity in First-person Targeting Tasks

Mouse sensitivity in first-person targeting tasks is a highly debated issue. Recommendations within a single game can vary by a factor of 10x or more and are an active topic of experimentation in both competitive and recreational esports communities. Inspired by work in pointer-based gain optimization and extending our previous results from the first user study focused on mouse sensitivity in first-person targeting tasks [1], we describe a range of optimal mouse sensitivity wherein players perform statistically significantly better in task completion time and throughput.

VaPr: Variable-Precision Tensors to Accelerate Robot Motion Planning

High-dimensional motion generation requires numerical precision for smooth, collision-free solutions. Typically, double-precision or single-precision floating-point (FP) formats are used for accurate results. Using these for big tensors imposes a strain on the memory bandwidth provided by the devices and alters the memory footprint, hence limiting their applicability to low-power edge devices needed for mobile robots. The uniform application of reduced precision can be advantageous but severely degrades solutions.

A distributed, decoupled system for losslessly streaming dynamic light probes to thin clients

We present a networked, high-performance graphics system that combines dynamic, high-quality, ray traced global illumination computed on a server with direct illumination and primary visibility computed on a client. This approach provides many of the image quality benefits of real-time ray tracing on low-power and legacy hardware, while maintaining a low latency response and mobile form factor.

Ryo Hachiuma

Ryo Hachiuma is a Research Scientist at NVIDIA Research Taiwan, working on Multi-Modal AI. He received his Ph.D. degree from Keio University, advised by Prof. Hideo Saito. Before joining NVIDIA Research, he was a computer vision engineer at Konica Minolta, Inc. in Japan, working on human action recognition.  His research interest is mainly in Human activity analysis from multi-sensory data (e.g., audio-visual, audio-visual-language).

 

Ed Schmerling

Ed Schmerling is a Research Scientist in the Autonomous Vehicle Research Group at NVIDIA. His main research interests are in the modeling and development of intelligent data-driven agents through advances in generative modeling, uncertainty quantification, and optimal control with applications in simulation, behavior planning, and safety assurance. Prior to joining NVIDIA, he served as the Associate Director of the Autonomous Systems Laboratory at Stanford University, and has previously worked as an AV researcher at Waymo. He received a Ph.D.

AI-Mediated 3D Video Conferencing

We present an AI-mediated 3D video conferencing system that can reconstruct and autostereoscopically display a life-sized talking head using consumer-grade compute resources and minimal capture equipment. Our 3D capture uses a novel 3D lifting method that encodes a given 2D input into an efficient triplanar neural representation of the user, which can be rendered from novel viewpoints in real-time. Our AI-based techniques drastically reduce the cost for 3D capture, while providing a high-fidelity 3D representation on the receiver's end at the cost of traditional 2D video streaming.