Xuning Yang1,
Rishit Dagli1,2,
Alex Zook1,
Hugo Hadfield1,
Ankit Goyal1,
Stan Birchfield1,
Fabio Ramos1,3,
Jonathan Tremblay1
1NVIDIA 2University of Toronto 3The University of Sydney
RoboLab is a robot- and policy-agnostic simulation benchmarking platform for evaluating real-world task-generalist robot policies. RoboLab is designed for policies trained on real-world data directly in simulation without co-training on simulation data in order to evaluate task generalization. RoboLab is built for fast scene and task generation with evolving task libraries to prevent benchmark oversaturation.
RoboLab allows users to directly generate scenes by physically arranging objects in simulation, and generate tasks by adding a language instruction, mirroring real-world evaluation setup. Scenes and tasks can be generated by hand in minutes, or using AI agentic workflows. Task libraries are robot- and policy- agnostic: Users combine task libraries with their own robot and policy to create ready-to-run environments. Same tasks libraries can be used with different robots and policies.
Physically arrange objects in simulation to build diverse scene libraries, manually or at scale with LLMs.
Create tasks by adding language instructions to scenes, manually or at scale with LLMs.
Specify the robot, policy, action/observation configs, scene variations to turn tasks into runnable envs.
Default Put the onion in the wood bowl
Vague Put the vegetable in the wood bowl
Specific Pick up the round onion and place it inside the wooden bowl
Default Pack canned foods into the bin
Vague Pack cans
Specific Pick up the canned food item and place it inside the bin
Default Put all the green fruit on the plate
Vague Plate green fruits
Specific Put all green limes on the table and place them on the flat, beige colored plate
Default Put the red hammer and black hammer in the left bin
Vague Sort some of the tools
Specific Pick up the red-handled hammer and the black-handled hammer and place both in the left bin
Default Put the small red yogurt in the red bowl
Vague Put the yogurt in the bowl
Specific Pick up the small red yogurt container and place it inside the red bowl
Default Put the white mug in the center of the table
Vague Move the mug to the center
Specific Pick up the white ceramic mug and place it in the center of the table surface away from the other objects
Default Move an orange to the white bowl
Vague Move a orange to the white bowl
Specific Pick up one orange citrus fruit and place it inside the white bowl
Default Stack the blocks in the order from bottom to top: red, blue, green, yellow
Vague Stack in the order of red, blue, green, yellow
Specific Build a tower by placing the red block first, then the blue block on top, then the green, and finally the yellow block on top as a single tower
Default Unstack the rubiks cube tower
Vague Unstack the tower
Specific Remove each rubiks cube from the stacked tower one at a time and place them separately on the table
RoboLab supports controlled perturbations across lighting, camera pose, backgrounds, textures, and shadows to systematically test policy robustness.
RoboLab supports code agents for agentic scene and task generation directly from the terminal. Use natural language to describe what you need, and let the agent handle asset placement, constraint satisfaction, and instruction authoring.
.usda file
We introduce the RoboLab Benchmark which evaluates policies across three competency axes spanning three difficulty levels.
Our benchmark emphasizes multi-step and two-step tasks compared to the DROID training distribution. In addition, only 68.7% of benchmark objects appear in the DROID training vocabulary.
Task success rates (%) of SOTA VLA models on RoboLab-120, across various levels of language specificity.
Each task is evaluated with 10 episodes minimum to account for environment stochasticity. All policies use off-the-shelf checkpoints fine-tuned on the DROID dataset.
| Model | Overall | Simple | Moderate | Complex | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Vague | Default | Specific | Vague | Default | Specific | Vague | Default | Specific | Vague | Default | Specific | |
| π0.5 | 17.1 | 25.0 | 25.6 | 18.4 | 26.7 | 28.8 | 20.0 | 27.7 | 26.3 | 5.3 | 12.4 | 11.1 |
| π0-FAST | 9.9 | 15.4 | 15.5 | 11.4 | 21.1 | 20.6 | 11.8 | 11.3 | 11.8 | 0.0 | 3.5 | 4.7 |
| π0 | 3.8 | 5.3 | 6.3 | 5.6 | 7.7 | 9.5 | 2.6 | 3.7 | 3.8 | 0.0 | 0.0 | 0.0 |
| GR00T N1.6 | Coming soon | |||||||||||
| paligemma-binning | Coming soon | |||||||||||
Understanding of inter-object temporal, numerical, and spatial relationships.
| Model | Conjunction | Counting | Spatial |
|---|---|---|---|
| π0.5 | 47.1 | 52.8 | 17.2 |
| π0-FAST | 28.7 | 41.0 | 15.4 |
| π0 | 22.5 | 11.0 | 2.1 |
| GR00T N1.6 | Coming soon | ||
| paligemma-binning | Coming soon | ||
Recognition of visual traits of objects, such as color, semantics, and size.
| Model | Color | Semantics | Size |
|---|---|---|---|
| π0.5 | 15.5 | 17.8 | 14.4 |
| π0-FAST | 5.9 | 10.3 | 7.8 |
| π0 | 1.4 | 3.0 | 1.7 |
| GR00T N1.6 | Coming soon | ||
| paligemma-binning | Coming soon | ||
Action-oriented reasoning including affordances, reorientation, and stacking objects.
| Model | Affordance | Reorientation | Stacking |
|---|---|---|---|
| π0.5 | 12.5 | 12.8 | 28.4 |
| π0-FAST | 2.0 | 2.8 | 1.7 |
| π0 | 0.8 | 1.1 | 0.0 |
| GR00T N1.6 | Coming soon | ||
| paligemma-binning | Coming soon | ||
Language instructions become more vague, policies struggle. Same scene, same goal—only the language changes
“Take all the bananas out of the bin” (8.9s)
“Take the bananas out” (20.1s)
“Empty the grey bin” DNF
| Instruction | π0.5 | π0-FAST |
|---|---|---|
| “Take all the bananas out of the grey bin and put it on the table” | 50 | 30 |
| “Take the bananas out” | 40 | 10 |
| “Empty the grey bin” | 10 | 70 |
| Instruction | π0.5 | π0-FAST |
|---|---|---|
| “Put the white mugs in the grey bin” | 80 | 20 |
| “Put the mugs in the bin” | 90 | 10 |
| “Put away mugs” | 0 | 0 |
Full episode rollouts of π0.5 executing tasks in RoboLab, showing both successes and errors encountered during execution.
Put all plastic bottles away in the bin
Make sure all the white mugs are upright so that the opening is facing upwards
Put the orange measuring cup and the blue measuring cup outside of the plate
VLAs frequently grasp visually similar or geometrically biased objects instead of the language-specified target. Numbers in parentheses indicate total wrong-grasp occurrences across 10 episodes.
“Put the limes on the plate”
Policies confuse visually similar green/round fruit.
“Pack boxed foods into the bin” (3 boxes/cans)
Geometric shape bias: cylindrical objects (cans) override language-specified rectangular targets (boxes).
“Move an orange to the white bowl”
Pumpkin's similar color and round shape dominates over correct orange identification.
“Take the measuring spoons out”
Semantic confusion between “measuring spoons” and “measuring cups”; bowl proximity dominates.
We use Neural Posterior Estimation (NPE) to quantify which environment parameters most influence policy success. Given an observation \(x\) (task outcome) and environment parameters \(\theta\), we estimate the posterior \(p(\theta \mid x) \propto p(x \mid \theta)\,p(\theta)\). Performing this result illustrates high sensitivity on the wrist camera, illustrating the sensitivity of the performance to the wrist camera.
@article{yang2025robolab,
title = {RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies},
author = {Yang, Xuning and Dagli, Rishit and Zook, Alex and Hadfield, Hugo and Goyal, Ankit and Birchfield, Stan and Ramos, Fabio and Tremblay, Jonathan},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
year = {2025},
}