3D Reconstruction with Generalizable Neural Fields using Scene Priors

High-fidelity 3D scene reconstruction has been substantially advanced by recent progress in neural fields. However, most existing methods train a separate network from scratch for each individual scene. This is not scalable, inefficient, and unable to yield good results given limited views. While learning-based multi-view stereo methods alleviate this issue to some extent, their multi-view setting makes it less flexible to scale up and to broad applications. Instead, we introduce training generalizable Neural Fields incorporating scene Priors (NFPs).

Policy Optimized Text-to-Image Pipeline Design

Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines. These combine fine-tuned generators, adapters, upscaling blocks and even editing steps, leading to significant improvements in image quality. However, their effective design requires substantial expertise.

Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image

Text-to-image diffusion models can generate stunning visuals, yet they often fail at tasks children find trivial - like placing a dog to the right of a teddy bear rather than to the left. When combinations get more unusual - a giraffe above an airplane—these failures become even more pronounced. Existing methods attempt to fix these spatial reasoning failures through model fine-tuning or test-time optimization with handcrafted losses that are suboptimal.

Alpamayo 1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

We introduce Alpamayo 1, a vision–language–action model (VLA) that integrates Chain of Causation reasoning with trajectory planning to enhance decision-making in complex driving scenarios.

Comprehensive evaluations with open-loop metrics, closed-loop simulation, and real-world vehicle tests demonstrate that Alpamayo 1 is state-of-the-art in multiple aspects (including reasoning, trajectory generation, alignment, safety, latency, and more).

Latent Action Pretraining from Videos

We introduce Latent Action Pretraining, the first unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels.

TWIN: Two-handed Intelligent Benchmark for Bimanual Manipulation

Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by presenting a benchmark for bimanual manipulation. A key functionality is the ability to autonomously generate training data without the necessity of human demonstrations to the robot.