RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

Xuning Yang1, Rishit Dagli1,2, Alex Zook1, Hugo Hadfield1, Ankit Goyal1, Stan Birchfield1, Fabio Ramos1,3, Jonathan Tremblay1
1NVIDIA    2University of Toronto    3The University of Sydney

RoboLab is a robot- and policy-agnostic simulation benchmarking platform for evaluating real-world task-generalist robot policies. RoboLab is designed for policies trained on real-world data directly in simulation without co-training on simulation data in order to evaluate task generalization. RoboLab is built for fast scene and task generation with evolving task libraries to prevent benchmark oversaturation.

RoboLab Framework

RoboLab allows users to directly generate scenes by physically arranging objects in simulation, and generate tasks by adding a language instruction, mirroring real-world evaluation setup. Scenes and tasks can be generated by hand in minutes, or using AI agentic workflows. Task libraries are robot- and policy- agnostic: Users combine task libraries with their own robot and policy to create ready-to-run environments. Same tasks libraries can be used with different robots and policies.

1 Scene Generation

Physically arrange objects in simulation to build diverse scene libraries, manually or at scale with LLMs.

2 Task Generation

Create tasks by adding language instructions to scenes, manually or at scale with LLMs.

3 Environment Generation

Specify the robot, policy, action/observation configs, scene variations to turn tasks into runnable envs.

Example Tasks

Default Put the onion in the wood bowl

Vague Put the vegetable in the wood bowl

Specific Pick up the round onion and place it inside the wooden bowl

Default Pack canned foods into the bin

Vague Pack cans

Specific Pick up the canned food item and place it inside the bin

Default Put all the green fruit on the plate

Vague Plate green fruits

Specific Put all green limes on the table and place them on the flat, beige colored plate

Default Put the red hammer and black hammer in the left bin

Vague Sort some of the tools

Specific Pick up the red-handled hammer and the black-handled hammer and place both in the left bin

Default Put the small red yogurt in the red bowl

Vague Put the yogurt in the bowl

Specific Pick up the small red yogurt container and place it inside the red bowl

Default Put the white mug in the center of the table

Vague Move the mug to the center

Specific Pick up the white ceramic mug and place it in the center of the table surface away from the other objects

Default Move an orange to the white bowl

Vague Move a orange to the white bowl

Specific Pick up one orange citrus fruit and place it inside the white bowl

Default Stack the blocks in the order from bottom to top: red, blue, green, yellow

Vague Stack in the order of red, blue, green, yellow

Specific Build a tower by placing the red block first, then the blue block on top, then the green, and finally the yellow block on top as a single tower

Default Unstack the rubiks cube tower

Vague Unstack the tower

Specific Remove each rubiks cube from the stacked tower one at a time and place them separately on the table

Scene Variations

RoboLab supports controlled perturbations across lighting, camera pose, backgrounds, textures, and shadows to systematically test policy robustness.

Agentic Workflows

RoboLab supports code agents for agentic scene and task generation directly from the terminal. Use natural language to describe what you need, and let the agent handle asset placement, constraint satisfaction, and instruction authoring.

Code Agent — /robolab-scenegen
> /robolab-scenegen
agent I'll help you generate a USD scene file. I need a few things:

1. Scene name — What should the file be called? (e.g., fruits_bowl.usda)
2. Scene description — What should be on the table?
3. Number of objects — How many? (default: 3–5)
4. Output directory — Where should I save it?

Objects I'll pick from: 312 objects — containers, kitchenware, fruits, tools, blocks, and more.
> kitchen counter with 3 mugs, 2 bowls, and a banana
agent Resolving placements via predicate solver… collision-free layout found.
Scene saved to assets/scenes/generated/kitchen_counter.usda

Running settle (physics simulation)… done.
Screenshot saved to assets/scenes/generated/_images/kitchen_counter.png
> |
Code Agent — /robolab-taskgen
> /robolab-taskgen
agent I'll help you generate a RoboLab task file. I need a few things:

1. Scene file — A filename or path to the .usda file
2. Task instruction — What should the robot do?
3. Episode length — Time in seconds (e.g., 50 for simple)
4. Output directory — Where should I save it?
> Scene: kitchen_counter.usda. Put the banana in the bowl. 50s.
agent Generated task with 3 instruction variants:
  default "Pick up the banana and place it in the bowl"
  vague   "Put the fruit in the bowl"
  specific "Grasp the yellow banana and place it inside the red bowl"
Task saved to robolab/tasks/demo/banana_in_bowl_task.py
> |

Benchmark

We introduce the RoboLab Benchmark which evaluates policies across three competency axes spanning three difficulty levels.

120 tasks Average subtasks per task: 2.02 Average objects per task: 9.0 Average difficulty score: 2.90

Category Breakdown

Distribution by Difficulty

Comparison to DROID Dataset

Our benchmark emphasizes multi-step and two-step tasks compared to the DROID training distribution. In addition, only 68.7% of benchmark objects appear in the DROID training vocabulary.

Task Length Comparison

Word-Level Object Vocabulary Overlap

2692 Both 68 31 DROID (2,760) Benchmark (99) Benchmark coverage: 68.7% ยท DROID coverage: 2.5%

Results

Overall

Task success rates (%) of SOTA VLA models on RoboLab-120, across various levels of language specificity.

Each task is evaluated with 10 episodes minimum to account for environment stochasticity. All policies use off-the-shelf checkpoints fine-tuned on the DROID dataset.

Model Overall Simple Moderate Complex
Vague Default Specific Vague Default Specific Vague Default Specific Vague Default Specific
π0.5 17.1 25.0 25.6 18.4 26.7 28.8 20.0 27.7 26.3 5.3 12.4 11.1
π0-FAST 9.9 15.4 15.5 11.4 21.1 20.6 11.8 11.3 11.8 0.0 3.5 4.7
π0 3.8 5.3 6.3 5.6 7.7 9.5 2.6 3.7 3.8 0.0 0.0 0.0
GR00T N1.6 Coming soon
paligemma-binning Coming soon

Competency-Axes Breakdown

Relational Competency

Understanding of inter-object temporal, numerical, and spatial relationships.

Model Conjunction Counting Spatial
π0.547.152.817.2
π0-FAST28.741.015.4
π022.511.02.1
GR00T N1.6Coming soon
paligemma-binningComing soon

Visual Competency

Recognition of visual traits of objects, such as color, semantics, and size.

Model Color Semantics Size
π0.515.517.814.4
π0-FAST5.910.37.8
π01.43.01.7
GR00T N1.6Coming soon
paligemma-binningComing soon

Procedural Competency

Action-oriented reasoning including affordances, reorientation, and stacking objects.

Model Affordance Reorientation Stacking
π0.512.512.828.4
π0-FAST2.02.81.7
π00.81.10.0
GR00T N1.6Coming soon
paligemma-binningComing soon

Language Specificity

Language instructions become more vague, policies struggle. Same scene, same goal—only the language changes

“Take all the bananas out of the bin” (8.9s)

“Take the bananas out” (20.1s)

“Empty the grey bin” DNF

Bananas Out Of Bin Task

Instruction π0.5 π0-FAST
“Take all the bananas out of the grey bin and put it on the table” 50 30
“Take the bananas out” 40 10
“Empty the grey bin” 10 70

White Mugs In Bin Task

Instruction π0.5 π0-FAST
“Put the white mugs in the grey bin” 80 20
“Put the mugs in the bin” 90 10
“Put away mugs” 0 0

Example Episode Rollouts

Full episode rollouts of π0.5 executing tasks in RoboLab, showing both successes and errors encountered during execution.

Put all plastic bottles away in the bin

Make sure all the white mugs are upright so that the opening is facing upwards

Put the orange measuring cup and the blue measuring cup outside of the plate

Sensitivity Analysis

We use Neural Posterior Estimation (NPE) to quantify which environment parameters most influence policy success. Given an observation \(x\) (task outcome) and environment parameters \(\theta\), we estimate the posterior \(p(\theta \mid x) \propto p(x \mid \theta)\,p(\theta)\). Performing this result illustrates high sensitivity on the wrist camera, illustrating the sensitivity of the performance to the wrist camera.

Sensitivity analysis results showing the effect of environment parameters on policy success

Leaderboard

Coming Soon

The RoboLab leaderboard and results submission portal are under active development. We will support submitting policy checkpoints and client code for open-source sharing, as well as submission of result files for community validation and reproducibility.

BibTeX

@article{yang2025robolab,
  title     = {RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies},
  author    = {Yang, Xuning and Dagli, Rishit and Zook, Alex and Hadfield, Hugo and Goyal, Ankit and Birchfield, Stan and Ramos, Fabio and Tremblay, Jonathan},
  journal   = {arXiv preprint arXiv:XXXX.XXXXX},
  year      = {2025},
}