Learning to Track Instances without Video Annotations

Tracking segmentation masks of multiple instances has been intensively studied, but still faces two fundamental challenges: 1) the requirement of large-scale, frame-wise annotation, and 2) the complexity of two-stage approaches. To resolve these challenges, we introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences. With an instance contrastive objective, we learn an embedding to discriminate each instance from the others.

Weakly-Supervised Physically Unconstrained Gaze Estimation

A major challenge for physically unconstrained gaze estimation is acquiring training data with 3D gaze annotations for in-the-wild and outdoor scenarios. In contrast, videos of human interactions in unconstrained environments are abundantly available and can be much more easily annotated with frame-level activity labels. In this work, we tackle the previously unexplored problem of weakly-supervised gaze estimation from videos of human interactions.

Contrastive Syn-to-Real Generalization

Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that the diversity of the learned feature embeddings plays an important role in the generalization performance.

Haotian Zhang

Haotian Zhang is a Senior Research Scientist at NV Cosmos. His research aims to enable embodied agents to understand the outside world. To that end, he works on designing sensible modules that learn the effective representation of information from Vision & Language. Haotian's work on GLIP was awarded as CVPR 2022 Best Paper Finalist. Prior to joining NV, he obtained his Ph.D. at the University of Washington. Haotian believes that living an interesting life is done by doing interesting things with interesting people, and that’s what he hopes to do.

Chi-Pin Huang

Chi-Pin Huang is a Research Scientist at NVIDIA Research Taiwan. His research focuses on Vision-Language Generative Models and Vision-Language-Action Models (VLAs), with particular interest in bridging perception, generation, and decision-making. He received his Ph.D. degree from National Taiwan University in 2026 under the supervision of Prof. Yu-Chiang Frank Wang, and earned his B.S.

Sameer Dharur

Sameer Dharur is a research scientist on the Cosmos team at NVIDIA, helping to build vision-language-models (VLMs) that reason better about the world. Prior to that, he spent ~4.5 years as a researcher and engineer at Apple specializing in computer vision and natural language processing to solve problems in image and video understanding, question answering, and robotics.