Learning to Track Instances without Video Annotations

Tracking segmentation masks of multiple instances has been intensively studied, but still faces two fundamental challenges: 1) the requirement of large-scale, frame-wise annotation, and 2) the complexity of two-stage approaches. To resolve these challenges, we introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences. With an instance contrastive objective, we learn an embedding to discriminate each instance from the others.

Weakly-Supervised Physically Unconstrained Gaze Estimation

A major challenge for physically unconstrained gaze estimation is acquiring training data with 3D gaze annotations for in-the-wild and outdoor scenarios. In contrast, videos of human interactions in unconstrained environments are abundantly available and can be much more easily annotated with frame-level activity labels. In this work, we tackle the previously unexplored problem of weakly-supervised gaze estimation from videos of human interactions.

Contrastive Syn-to-Real Generalization

Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that the diversity of the learned feature embeddings plays an important role in the generalization performance.

Hunting CUDA Bugs at Scale with cuFuzz

GPUs play an increasingly important role in modern software. However, the heterogeneous host-device execution model and expanding software stack make GPU programs prone to memory-safety and concurrency bugs that evade static analyses. While fuzz-testing, combined with dynamic error checking tools, offers a plausible solution, it remains underutilized for GPUs.

Alpha-Vision: A Real-Time Always-on Vision Processor with 787µs Face Detection Latency in <5mW

ALPhA-Vision is an always-on low-power subsystem for DNN-inference-based vision tasks in edge SoCs. Flexible and programmable, the subsystem supports CNN and ViT inference and employs hardware/software co-design to enable fully end-to-end execution with no external memory accesses. Fine-grained power management features to mitigate leakage enable the subsystem to perform face detection with 787µs latency and 99.3% detection accuracy with 4.6 mW average power at 60fps.

GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers

Diffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications.

Gian Marti

Gian Marti is a Research Scientist at NVIDIA. He earned his B.Sc. and M.Sc. degrees in Electrical Engineering from ETH Zurich in 2017 and 2019, and completed his Ph.D. there in 2025. From 2019 to early 2026, he worked at ETH Zurich’s Signal and Information Processing Laboratory and later at the Integrated Information Processing Group. He has also interned at ABB, Kistler, and NVIDIA.