Learning to Track Instances without Video Annotations

Tracking segmentation masks of multiple instances has been intensively studied, but still faces two fundamental challenges: 1) the requirement of large-scale, frame-wise annotation, and 2) the complexity of two-stage approaches. To resolve these challenges, we introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences. With an instance contrastive objective, we learn an embedding to discriminate each instance from the others.

Weakly-Supervised Physically Unconstrained Gaze Estimation

A major challenge for physically unconstrained gaze estimation is acquiring training data with 3D gaze annotations for in-the-wild and outdoor scenarios. In contrast, videos of human interactions in unconstrained environments are abundantly available and can be much more easily annotated with frame-level activity labels. In this work, we tackle the previously unexplored problem of weakly-supervised gaze estimation from videos of human interactions.

Contrastive Syn-to-Real Generalization

Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that the diversity of the learned feature embeddings plays an important role in the generalization performance.

Peter Kocsis

Peter joined NVIDIA as Research Scientist in 2026 April. His research focuses on inverse and forward rendering using generative image priors to solve fundamental challenges in material estimation and light transport. He finished his PhD at the Technical University of Munich, under the supervision of Prof. Dr. Matthias Nießner. Peter began his career as a mechatronics engineer. His early works span a broad domain from neural control for robotics to active learning for computer vision.

Bing Xu

Bing is a research scientist in the Real-Time Graphics Research Group at NVIDIA, where her work focuses on physically-based rendering and neural rendering. In the past, she explored forward and inverse light transport, importance sampling, denoising, and deep learning approaches for appearance modeling.

Bing earned her PhD from the University of California, San Diego, advised by Prof. Ravi Ramamoorthi. She holds a Bachelor’s degree from the University of Hong Kong and gained industry experience in offline rendering and 3D designer tools before moving into research.