Learning to Track Instances without Video Annotations

Tracking segmentation masks of multiple instances has been intensively studied, but still faces two fundamental challenges: 1) the requirement of large-scale, frame-wise annotation, and 2) the complexity of two-stage approaches. To resolve these challenges, we introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences. With an instance contrastive objective, we learn an embedding to discriminate each instance from the others.

Weakly-Supervised Physically Unconstrained Gaze Estimation

A major challenge for physically unconstrained gaze estimation is acquiring training data with 3D gaze annotations for in-the-wild and outdoor scenarios. In contrast, videos of human interactions in unconstrained environments are abundantly available and can be much more easily annotated with frame-level activity labels. In this work, we tackle the previously unexplored problem of weakly-supervised gaze estimation from videos of human interactions.

Contrastive Syn-to-Real Generalization

Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that the diversity of the learned feature embeddings plays an important role in the generalization performance.

CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language

Masked Diffusion Models (MDMs) provide an efficient non-causal alternative to autoregressive generation but often struggle with token dependencies and semantic incoherence due to their reliance on discrete marginal distributions. We address these limitations by shifting the diffusion process into a continuous sentence-level semantic space. We propose CRoCoDiL -- Continuous and Robust Conditioned Diffusion for Language --  a unified fine-tuning approach that jointly trains an encoder–demasker architecture, grounding the MDM demasking in continuous latent representations.

Timing Matters: The Impact of Event-Specific Frametime Spikes in First-Person Shooter Games

Frametime spikes can disrupt gameplay in first-person shooter (FPS) games, affecting both performance and player experience. This paper examines how spikes during specific game events impact players. We developed a custom FPS game that maintains a steady 500 frames/s while inducing frametime spikes during weapon reloading, fast mouse movement, or targeting. Thirty-eight (38) participants played the game in a user study, providing both performance data and user-reported visual smoothness.

Lead Rush: A First-Person Shooter for User Studies and Understanding Effects of Frame Time Spikes

User studies are a cornerstone of human-computer interaction research, including measures of user performance and quality of experience (QoE) – particularly important for games where frame rates and frame timings can impact performance. Unfortunately, commercial games have limited options for customization and do not log player performance data with sufficient detail for use in such studies. This paper introduces Lead Rush, a first-person shooter game designed for conducting user studies on the effects of frame timing and frame rate.