Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image

Text-to-image diffusion models can generate stunning visuals, yet they often fail at tasks children find trivial - like placing a dog to the right of a teddy bear rather than to the left. When combinations get more unusual - a giraffe above an airplane—these failures become even more pronounced. Existing methods attempt to fix these spatial reasoning failures through model fine-tuning or test-time optimization with handcrafted losses that are suboptimal.

Alpamayo 1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

We introduce Alpamayo 1, a vision–language–action model (VLA) that integrates Chain of Causation reasoning with trajectory planning to enhance decision-making in complex driving scenarios.

Comprehensive evaluations with open-loop metrics, closed-loop simulation, and real-world vehicle tests demonstrate that Alpamayo 1 is state-of-the-art in multiple aspects (including reasoning, trajectory generation, alignment, safety, latency, and more).

Latent Action Pretraining from Videos

We introduce Latent Action Pretraining, the first unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels.

TWIN: Two-handed Intelligent Benchmark for Bimanual Manipulation

Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by presenting a benchmark for bimanual manipulation. A key functionality is the ability to autonomously generate training data without the necessity of human demonstrations to the robot.

WebFPSci

Web FirstPersonScience (WebFPSci) is a port of our popular G3D-based FirstPersonScience (FPSci) shooter platform.

💻 Try out the Fullscreen Version

🔎 View Source on Github

Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers

In this paper, we study task-oriented human grasp synthesis, a new grasp synthesis task that demands both task and context awareness. At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the manipulated object and its relation with the hand, our enhanced maps take into account scene and task information. This comprehensive map is critical for hand-object interaction, enabling accurate grasping poses that align with the task.