Learning to Track Instances without Video Annotations

Tracking segmentation masks of multiple instances has been intensively studied, but still faces two fundamental challenges: 1) the requirement of large-scale, frame-wise annotation, and 2) the complexity of two-stage approaches. To resolve these challenges, we introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences. With an instance contrastive objective, we learn an embedding to discriminate each instance from the others.

Weakly-Supervised Physically Unconstrained Gaze Estimation

A major challenge for physically unconstrained gaze estimation is acquiring training data with 3D gaze annotations for in-the-wild and outdoor scenarios. In contrast, videos of human interactions in unconstrained environments are abundantly available and can be much more easily annotated with frame-level activity labels. In this work, we tackle the previously unexplored problem of weakly-supervised gaze estimation from videos of human interactions.

Contrastive Syn-to-Real Generalization

Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that the diversity of the learned feature embeddings plays an important role in the generalization performance.

Elie Aljalbout

I'm a research scientist at the Seattle Robotics Lab. Prior to joining NVIDIA, I was working on world modeling for robotics at Meta FAIR. Before that, I was a postdoctoral researcher in Zurich, Switzerland, working on agile robotics. I completed my PhD at TU Munich while working as research scientist at the Volkswagen Machine Learning Research Lab.

Dvir Samuel

Dvir Samuel joined NVIDIA Research as a Research Scientist in 2026. His main fields of interest are machine learning and generative modeling. In particular, he studies diffusion- and flow-matching methods for image and video generation and editing, with an emphasis on personalization, controllability, and learning under long-tailed data regimes.

Dvir completed his Ph.D. at Bar-Ilan University under the supervision of Prof. Gal Chechik. His research spans long-tail and few-shot learning, as well as modern generative approaches for visual content creation and manipulation.

Aim My Robot: Precision Local Navigation to Any Object

Existing navigation systems mostly consider “success” when the robot reaches within 1 m radius to a goal. This precision is insufficient for emerging applications where a robot needs to be positioned precisely relative to an object for downstream tasks, such as docking, inspection, and manipulation. To this end, we design and implement Aim-My-Robot (AMR), a local navigation system that enables a robot to reach any object in its vicinity at the desired relative pose, with centimeter-level accuracy.