RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

RoboLab Framework

RoboLab allows users to directly generate scenes by physically arranging objects in simulation, and generate tasks by adding a language instruction, mirroring real-world evaluation setup. Scenes and tasks can be generated by hand in minutes, or using AI agentic workflows. Task libraries are robot- and policy- agnostic: Users combine task libraries with their own robot and policy to create ready-to-run environments. Same tasks libraries can be used with different robots and policies.

1 Scene Generation

Physically arrange objects in simulation to build diverse scene libraries, manually or at scale with LLMs.

2 Task Generation

Create tasks by adding language instructions to scenes, manually or at scale with LLMs.

3 Environment Generation

Specify the robot, policy, action/observation configs, scene variations to turn tasks into runnable envs.

Example Tasks

Default Put the onion in the wood bowl

Vague Put the vegetable in the wood bowl

Specific Pick up the round onion and place it inside the wooden bowl

Default Pack canned foods into the bin

Vague Pack cans

Specific Pick up the canned food item and place it inside the bin

Default Put all the green fruit on the plate

Vague Plate green fruits

Specific Put all green limes on the table and place them on the flat, beige colored plate

Default Put the red hammer and black hammer in the left bin

Vague Sort some of the tools

Specific Pick up the red-handled hammer and the black-handled hammer and place both in the left bin

Default Put the small red yogurt in the red bowl

Vague Put the yogurt in the bowl

Specific Pick up the small red yogurt container and place it inside the red bowl

Default Put the white mug in the center of the table

Vague Move the mug to the center

Specific Pick up the white ceramic mug and place it in the center of the table surface away from the other objects

Default Move an orange to the white bowl

Vague Move a orange to the white bowl

Specific Pick up one orange citrus fruit and place it inside the white bowl

Default Stack the blocks in the order from bottom to top: red, blue, green, yellow

Vague Stack in the order of red, blue, green, yellow

Specific Build a tower by placing the red block first, then the blue block on top, then the green, and finally the yellow block on top as a single tower

Default Unstack the rubiks cube tower

Vague Unstack the tower

Specific Remove each rubiks cube from the stacked tower one at a time and place them separately on the table

Scene Variations

RoboLab supports controlled perturbations across lighting, camera pose, backgrounds, textures, and shadows to systematically test policy robustness.

Agentic Workflows

RoboLab supports code agents for agentic scene and task generation directly from the terminal. Use natural language to describe what you need, and let the agent handle asset placement, constraint satisfaction, and instruction authoring.

Code Agent — /robolab-scenegen

> /robolab-scenegen

agent I'll help you generate a USD scene file. I need a few things:

1. Scene name — What should the file be called? (e.g., fruits_bowl.usda)
2. Scene description — What should be on the table?
3. Number of objects — How many? (default: 3–5)
4. Output directory — Where should I save it?

Objects I'll pick from: 312 objects — containers, kitchenware, fruits, tools, blocks, and more.

> kitchen counter with 3 mugs, 2 bowls, and a banana

agent Resolving placements via predicate solver… collision-free layout found.
Scene saved to assets/scenes/generated/kitchen_counter.usda

Running settle (physics simulation)… done.
Screenshot saved to assets/scenes/generated/_images/kitchen_counter.png

> |

Code Agent — /robolab-taskgen

> /robolab-taskgen

agent I'll help you generate a RoboLab task file. I need a few things:

1. Scene file — A filename or path to the .usda file
2. Task instruction — What should the robot do?
3. Episode length — Time in seconds (e.g., 50 for simple)
4. Output directory — Where should I save it?

> Scene: kitchen_counter.usda. Put the banana in the bowl. 50s.

agent Generated task with 3 instruction variants:
  default "Pick up the banana and place it in the bowl"
  vague   "Put the fruit in the bowl"
  specific "Grasp the yellow banana and place it inside the red bowl"
Task saved to robolab/tasks/demo/banana_in_bowl_task.py

> |

Benchmark

We introduce the RoboLab Benchmark which evaluates policies across three competency axes spanning three difficulty levels.

Visual — object recognition of color, semantics, and size
Relational — understanding of inter-object temporal, numerical, and spatial relationships
Procedural — action-oriented reasoning including affordances, reorientation

120 tasks Average subtasks per task: 2.02 Average objects per task: 9.0 Average difficulty score: 2.90

Category Breakdown

Distribution by Difficulty

Comparison to DROID Dataset

Our benchmark emphasizes multi-step and two-step tasks compared to the DROID training distribution. In addition, only 68.7% of benchmark objects appear in the DROID training vocabulary.

Task Length Comparison

Word-Level Object Vocabulary Overlap

Results

Overall

Task success rates (%) of SOTA VLA models on RoboLab-120, across various levels of language specificity.

Each task is evaluated with 10 episodes minimum to account for environment stochasticity. All policies use off-the-shelf checkpoints fine-tuned on the DROID dataset.

Model	Overall			Simple			Moderate			Complex
Model	Vague	Default	Specific	Vague	Default	Specific	Vague	Default	Specific	Vague	Default	Specific
π_0.5	17.1	25.0	25.6	18.4	26.7	28.8	20.0	27.7	26.3	5.3	12.4	11.1
π₀-FAST	9.9	15.4	15.5	11.4	21.1	20.6	11.8	11.3	11.8	0.0	3.5	4.7
π₀	3.8	5.3	6.3	5.6	7.7	9.5	2.6	3.7	3.8	0.0	0.0	0.0
GR00T N1.6	Coming soon
paligemma-binning	Coming soon

Competency-Axes Breakdown

Relational Competency

Understanding of inter-object temporal, numerical, and spatial relationships.

Model	Conjunction	Counting	Spatial
π_0.5	47.1	52.8	17.2
π₀-FAST	28.7	41.0	15.4
π₀	22.5	11.0	2.1
GR00T N1.6	Coming soon
paligemma-binning	Coming soon

Visual Competency

Recognition of visual traits of objects, such as color, semantics, and size.

Model	Color	Semantics	Size
π_0.5	15.5	17.8	14.4
π₀-FAST	5.9	10.3	7.8
π₀	1.4	3.0	1.7
GR00T N1.6	Coming soon
paligemma-binning	Coming soon

Procedural Competency

Action-oriented reasoning including affordances, reorientation, and stacking objects.

Model	Affordance	Reorientation	Stacking
π_0.5	12.5	12.8	28.4
π₀-FAST	2.0	2.8	1.7
π₀	0.8	1.1	0.0
GR00T N1.6	Coming soon
paligemma-binning	Coming soon

Language Specificity

Language instructions become more vague, policies struggle. Same scene, same goal—only the language changes

“Take all the bananas out of the bin” (8.9s)

“Take the bananas out” (20.1s)

“Empty the grey bin” DNF

Bananas Out Of Bin Task

Instruction	π_0.5	π₀-FAST
“Take all the bananas out of the grey bin and put it on the table”	50	30
“Take the bananas out”	40	10
“Empty the grey bin”	10	70

White Mugs In Bin Task

Instruction	π_0.5	π₀-FAST
“Put the white mugs in the grey bin”	80	20
“Put the mugs in the bin”	90	10
“Put away mugs”	0	0

Example Episode Rollouts

Full episode rollouts of π_0.5 executing tasks in RoboLab, showing both successes and errors encountered during execution.

Put all plastic bottles away in the bin

Make sure all the white mugs are upright so that the opening is facing upwards

Put the orange measuring cup and the blue measuring cup outside of the plate

Failure Analysis: Wrong Object Grasps

VLAs frequently grasp visually similar or geometrically biased objects instead of the language-specified target. Numbers in parentheses indicate total wrong-grasp occurrences across 10 episodes.

0% success

“Put the limes on the plate”

Expected: limes

π_0.5 grabbed: lemon (15×), red onion (3×), pomegranate (1×)

π₀ grabbed: pomegranate (11×), red onion (2×), orange (1×)

Policies confuse visually similar green/round fruit.

0% success (all models)

“Pack boxed foods into the bin” (3 boxes/cans)

Expected: sugar box, pudding box

π_0.5 grabbed: spam can (4×), soup can (1×)

GR00T grabbed: spam can (8×), soup can (3×)

Geometric shape bias: cylindrical objects (cans) override language-specified rectangular targets (boxes).

0% success

“Move an orange to the white bowl”

Expected: orange

π_0.5 grabbed: pumpkin (21×), pumpkin small (4×)

π₀ grabbed: pumpkin (3×), wooden bowl (1×)

Pumpkin's similar color and round shape dominates over correct orange identification.

0–10% success

“Take the measuring spoons out”

Expected: measuring spoons

π_0.5 grabbed: bowl (54×), measuring cup (47×), drill (1×)

GR00T grabbed: bowl (9×), measuring cup (8×), drill (7×)

Semantic confusion between “measuring spoons” and “measuring cups”; bowl proximity dominates.

Sensitivity Analysis

We use Neural Posterior Estimation (NPE) to quantify which environment parameters most influence policy success. Given an observation \(x\) (task outcome) and environment parameters \(\theta\), we estimate the posterior \(p(\theta \mid x) \propto p(x \mid \theta)\,p(\theta)\). Performing this result illustrates high sensitivity on the wrist camera, illustrating the sensitivity of the performance to the wrist camera.

Leaderboard

BibTeX

@article{yang2025robolab,
  title     = {RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies},
  author    = {Yang, Xuning and Dagli, Rishit and Zook, Alex and Hadfield, Hugo and Goyal, Ankit and Birchfield, Stan and Ramos, Fabio and Tremblay, Jonathan},
  journal   = {arXiv preprint arXiv:XXXX.XXXXX},
  year      = {2025},
}