RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

RoboLab-120 Overall

Performance of open-sourced task generalist models across three levels of language specificity.

All models were fine-tuned only on the real-world DROID dataset.

#	Policy	Type	N	SR%	Score	EE Speed	EE SPARC
1	Cosmos3-Nano-Policy	WAM	441 / 1200	36.8%	51.9	7.13 cm/s	−5.99
2	π_0.5	VLA	336 / 1200	28.0%	43.4	5.35 cm/s	−8.34
3	DreamZero	WAM	308 / 1200	25.7%	39.8	3.64 cm/s	−6.41
4	Cosmos3-Edge-Policy	WAM	275 / 1200	22.9%	—	7.04 cm/s	−5.46
5	π₀-FAST	VLA	186 / 1200	15.5%	26.9	4.60 cm/s	−9.63
6	GR00T N1.6	VLA	87 / 1200	7.2%	17.1	4.30 cm/s	−6.87
7	π₀	VLA	60 / 1200	5.0%	12.2	4.18 cm/s	−9.49
8	paligemma-binning	VLA	41 / 1200	3.4%	9.9	1.93 cm/s	−16.52

Results by Task Difficulty

Difficulty reflects the number of subtasks and reasoning complexity per task, grouped into 3 difficulty levels.

		Simple			Moderate			Complex
#	Policy	N	SR%	Score	N	SR%	Score	N	SR%	Score
1	Cosmos3-Nano-Policy	260 / 640	40.6%	51.0	138 / 390	35.4%	52.3	43 / 170	25.3%	54.4
2	π_0.5	190 / 640	29.7%	38.9	123 / 390	31.5%	50.5	23 / 170	13.5%	44.1
3	DreamZero	167 / 640	26.1%	35.5	117 / 390	30.0%	48.4	24 / 170	14.1%	35.9
4	Cosmos3-Edge-Policy	164 / 640	25.6%	—	91 / 390	23.3%	—	20 / 170	11.8%	—
5	π₀-FAST	129 / 640	20.2%	26.7	52 / 390	13.3%	29.2	5 / 170	2.9%	21.9
6	GR00T N1.6	56 / 640	8.8%	13.6	31 / 390	7.9%	25.1	0 / 170	0.0%	11.9
7	π₀	46 / 640	7.2%	9.6	14 / 390	3.6%	18.1	0 / 170	0.0%	8.7
8	paligemma-binning	22 / 640	3.4%	6.2	19 / 390	4.9%	16.2	0 / 170	0.0%	9.0

Results by Competency Axes

Performance across competency axes: Visual (color, semantics, size), Procedural (action-oriented reasoning across affordances, reorientation, and stacking), and Relational (interpreting conjunctions, counting, and spatial relationships in multi-object instructions).

		Relational			Visual			Procedural
#	Policy	N	SR%	Score	N	SR%	Score	N	SR%	Score
1	Cosmos3-Nano-Policy	207 / 420	49.3%	60.9	219 / 840	26.1%	42.9	60 / 220	27.3%	46.8
2	π_0.5	142 / 420	33.8%	46.8	200 / 840	23.8%	38.8	48 / 220	21.8%	41.6
3	DreamZero	140 / 420	33.3%	45.5	147 / 840	17.5%	31.9	58 / 220	26.4%	42.3
4	Cosmos3-Edge-Policy	112 / 420	26.7%	—	141 / 840	16.8%	—	33 / 220	15.0%	—
5	π₀-FAST	98 / 420	23.3%	36.7	74 / 840	8.8%	19.5	14 / 220	6.4%	16.4
6	π₀	27 / 420	6.4%	18.8	16 / 840	1.9%	8.2	1 / 220	0.5%	6.6
7	GR00T N1.6	42 / 420	10.0%	23.2	42 / 840	5.0%	15.1	6 / 220	2.7%	11.1
8	paligemma-binning	24 / 420	5.7%	15.6	23 / 840	2.7%	9.3	1 / 220	0.5%	4.0

Sim-Real Correlation

Benchmark-level Correlation with Real World Experiments

Performance ranking correlate strongly with RoboArena Elo scores, an open-source real-world benchmark for generalist robot policies. This means that at a benchmark level, policies rank in our benchmark as they do in the real world.

Spearman ρ —

Adding Your Model

If you have a model that you'd like to see featured, fill out the form below:

Submit Your Model