GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning

Gaussian splatting has emerged as a powerful 3D representation that harnesses the advantages of both explicit (mesh) and implicit (NeRF) 3D representations. In this paper, we seek to leverage Gaussian splatting to generate realistic animatable avatars from textual descriptions, addressing the limitations (e.g., flexibility and efficiency) imposed by mesh or NeRF-based representations.

3D Reconstruction with Generalizable Neural Fields using Scene Priors

High-fidelity 3D scene reconstruction has been substantially advanced by recent progress in neural fields. However, most existing methods train a separate network from scratch for each individual scene. This is not scalable, inefficient, and unable to yield good results given limited views. While learning-based multi-view stereo methods alleviate this issue to some extent, their multi-view setting makes it less flexible to scale up and to broad applications. Instead, we introduce training generalizable Neural Fields incorporating scene Priors (NFPs).

Policy Optimized Text-to-Image Pipeline Design

Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines. These combine fine-tuned generators, adapters, upscaling blocks and even editing steps, leading to significant improvements in image quality. However, their effective design requires substantial expertise.

Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image

Text-to-image diffusion models can generate stunning visuals, yet they often fail at tasks children find trivial - like placing a dog to the right of a teddy bear rather than to the left. When combinations get more unusual - a giraffe above an airplane—these failures become even more pronounced. Existing methods attempt to fix these spatial reasoning failures through model fine-tuning or test-time optimization with handcrafted losses that are suboptimal.

Alpamayo 1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

We introduce Alpamayo 1, a vision–language–action model (VLA) that integrates Chain of Causation reasoning with trajectory planning to enhance decision-making in complex driving scenarios.

Comprehensive evaluations with open-loop metrics, closed-loop simulation, and real-world vehicle tests demonstrate that Alpamayo 1 is state-of-the-art in multiple aspects (including reasoning, trajectory generation, alignment, safety, latency, and more).

Latent Action Pretraining from Videos

We introduce Latent Action Pretraining, the first unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels.