Research

DreamGen: Unlocking Generalization in Robot Learning through Neural Trajectories

May 20, 2025

Robot foundation models (i.e. Vision-Language-Action models) have demonstrated promising results as general-purpose robotic system through a well-known protocol: imitation learning on human teleoperated demonstrations which are sourced from diverse behaviors and environments to enable stronger generalization. However, collecting robot data through teleoperation require extensive time and human effort, limiting scalability. For example, to enable stronger environment generalization capabilities, one would have to collect human demonstrations from diverse environments, taking the physical robot around collecting data. Furthermore, behavior generalization, that is learning new "verbs" outside of the teleoperation data has yet to be shown in literature with imiltation learning methods, further limiting scalability. In this blog post, we introduce DreamGen, a 4-stage pipeline to generate neural trajectories, synthetic robot data generated from video world models. This work is the first in literature to enable zero-shot​ behavior generalization and zero-shot environment generailzation, shifting the pardigm of robot learning from scaling human teleoperation data to scaling GPU compute through world models.

DreamGen is divided into 4-steps:

  1. We first finetune video world models (image-to-video diffusion models) on a target robot to learn the dynamics of the given robot embodiment.​
  2. We prompt the models with initial frames and language instructions, generating robot videos that not only include in-domain behaviors, but also novel behaviors in novel environments.​
  3. We extract pseudo robot actions via a latent action model or an inverse dynamics model (IDM).​
  4. We use these videos labeled with pseudo actions, named as neural trajectories, for downstream visuomotor policy learning​

With DreamGen, we enable a humanoid robot to perform 22 new verbs in 10 new envionrments. In the next subsections, we show videos of the generated neural trajectory along with the visuomotor policy execution from training soley on the neural trajectories (50 per task) for learning (1) new verbs in the lab, (2) seen verbs in new scenes, and (3) new verbs in new scenes. Note that both the video world model and the IDM model was trianed only on pick-and-place teleoperation data in a single environment.

#1. Behavior Generalization:​

Select a behavior to see the corresponding neural trajectory and policy rollout videos:

Neural Trajectory

Loading video...

Policy Rollout

Loading video...

#2. Environment Generalization:​

Select an environment task to see the corresponding neural trajectory and policy rollout videos:

Neural Trajectory

Loading video...

Policy Rollout

Loading video...

#3. Behavior + Environment Generalization

Select a behavior and environment task to see the corresponding neural trajectory and policy rollout videos:

Neural Trajectory

Loading video...

Policy Rollout

Loading video...

Performance Comparison

The chart below shows the performance comparison between GR00T N1 and GR00T N1 w/ DreamGen across different generalization conditions:

DreamGen for augmenting data for contact-rich tasks

While prior work shows that generating synthetic robot data is possible through simulation, they still face the sim2real issue and have a hard time generating training data for contact-rich tasks. We show that DreamGen enables augmenting data for contact-rich tasks such as manipulating deformable objects (folding), tool-use (hammering), etc., and goes straight from real2real, starting first the initial frame. The same pipeline can also be applied to any robotic systems, including a single-arm robot (Franka) as well as a $100 robot arm (SO-100), and with multiple camera views (e.g. wrist cams). Below are neural trajectories from DreamGen along with the downstream policy co-trained with real-world and neural trajectories.

Neural Trajectories

Policy Rollouts

The graphs below shows the performance gain from DreamGen across different robotic platforms (GR1, Franka, SO-100) & visuomotor robot policies (DP, π0, GR00T N1), highlighting the flexibility of DreamGen.

GR1
Franka
SO 100

Data Scaling​

We analyze whether increasing the amount of neural trajectories would lead to better performance by measuring RoboCasa (sample neural trajectories and policy evals shown above) performance in simulation. We vary the total amount of neural trajectories, from 0 to 240k nerual trajectories, across different scenarios of ground-truth data (low, medium, high). We explore using both latent actions (LAPA) and IDM for getting pseudo actions. We observe that both IDM and LAPA actions lead in performance boost for all data regime. Also, we observe that there is a log-linear slope between the total number of neural trajectories and the downstream robot policy performance, establishing a promising new axis for scaling robot training data. ​

DreamGen Bench

Select a model to see different demonstrations across various zero-shot methods:

Input Frame

robocasa frame 1

Prompt: Pick the onion from the sink and place it on the plate located on the counter

WAN Zero Shot

Loading video...

WAN SFT

Loading video...

Hunyuan Zero Shot

Loading video...

Hunyuan SFT

Loading video...

CogVideo Zero Shot

Loading video...

CogVideo SFT

Loading video...

Cosmos Zero Shot

Loading video...

Cosmos SFT

Loading video...

We introduce DreamGen Bench, a world modeling benchmark that aims to quantify the capacity of existing video generative models to adapt to a specific robot embodiment. We measure two key metrics: instruction following (whether the generated video strictly adheres to given instructions) and physics following (evaluating the physical plausibility of the generated videos). We evaluate 3 video world models (Hunyuan, CogVideoX, and WAN 2.1) on 2 different setups (one in simulation on the Franka Emika robot and one in real on the Fourier GR1 Humanoid). We observe that the correlation between RobotWorldBench scores and downstream task scores show a strong positive correlation.​