Learning to Track Instances without Video Annotations

Tracking segmentation masks of multiple instances has been intensively studied, but still faces two fundamental challenges: 1) the requirement of large-scale, frame-wise annotation, and 2) the complexity of two-stage approaches. To resolve these challenges, we introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences. With an instance contrastive objective, we learn an embedding to discriminate each instance from the others.

Weakly-Supervised Physically Unconstrained Gaze Estimation

A major challenge for physically unconstrained gaze estimation is acquiring training data with 3D gaze annotations for in-the-wild and outdoor scenarios. In contrast, videos of human interactions in unconstrained environments are abundantly available and can be much more easily annotated with frame-level activity labels. In this work, we tackle the previously unexplored problem of weakly-supervised gaze estimation from videos of human interactions.

Contrastive Syn-to-Real Generalization

Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that the diversity of the learned feature embeddings plays an important role in the generalization performance.

Short-time, Wavelet-inspired Mouse Submovement Detection

Submovements are ballistic components of human motion constituting a large part of motor interaction and arising from the cyclical and overlapping cognitive processes of perception, motor planning, and motor execution. Extracting submovements is challenging as the motions tend to overlap, or start before the previous ends. We propose and evaluate use of a wavelet-inspired technique to accurately locate and parameterize submovements from one-dimensional speed time series.

Tianyi Xie

Tianyi is a Research Scientist at NVIDIA Research. He did his Ph.D. at the University of California, Los Angeles, where he was advised by Chenfanfu Jiang and Demetri Terzopoulos. He received his B.Eng. in Software Engineering from Shanghai Jiao Tong University. His research focuses on bridging physics-based simulation and generative AI, embedding physical priors into generative pipelines to produce visually compelling and physically consistent content, with applications spanning computer graphics, embodied AI, and robotics.

CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language

Masked Diffusion Models (MDMs) provide an efficient non-causal alternative to autoregressive generation but often struggle with token dependencies and semantic incoherence due to their reliance on discrete marginal distributions. We address these limitations by shifting the diffusion process into a continuous sentence-level semantic space. We propose CRoCoDiL -- Continuous and Robust Conditioned Diffusion for Language --  a unified fine-tuning approach that jointly trains an encoder–demasker architecture, grounding the MDM demasking in continuous latent representations.

Timing Matters: The Impact of Event-Specific Frametime Spikes in First-Person Shooter Games

Frametime spikes can disrupt gameplay in first-person shooter (FPS) games, affecting both performance and player experience. This paper examines how spikes during specific game events impact players. We developed a custom FPS game that maintains a steady 500 frames/s while inducing frametime spikes during weapon reloading, fast mouse movement, or targeting. Thirty-eight (38) participants played the game in a user study, providing both performance data and user-reported visual smoothness.