RoboLab Results

Overall Difficulty Competency Axes

RoboLab-120 Overall

Performance of open-sourced task generalist models across three levels of language specificity.

Each task is repeated with 10 episodes. All models were fine-tuned only on the real-world DROID dataset. End effector velocity does not consider inference latency.
SR% = binary success rate; Score = subtask-progress score (×100).

# Policy Type N SR% Score EE Speed EE SPARC
1 Cosmos3-Nano-Policy WAM 442 / 1200
36.8% 51.9 7.13 cm/s −5.99
2 π0.5 VLA 336 / 1200
28.0% 43.4 5.35 cm/s −8.34
3 DreamZero WAM 308 / 1200
25.7% 39.8 3.64 cm/s −6.41
4 π0-FAST VLA 186 / 1200
15.5% 26.9 4.60 cm/s −9.63
5 GR00T N1.6 VLA 86 / 1200
7.2% 17.1 4.30 cm/s −6.87
6 π0 VLA 60 / 1200
5.0% 12.2 4.18 cm/s −9.49
7 paligemma-binning VLA 41 / 1200
3.4% 9.9 1.93 cm/s −16.52

Results by Task Difficulty

Difficulty reflects the number of subtasks and reasoning complexity per task, grouped into 3 difficulty levels.

Simple Moderate Complex
# Policy N SR% Score N SR% Score N SR% Score
1 Cosmos3-Nano-Policy 260 / 640
40.6% 51.0 138 / 390
35.4% 52.3 43 / 170
25.3% 54.4
2 π0.5 190 / 640
29.7% 38.9 123 / 390
31.5% 50.5 23 / 170
13.5% 44.1
3 DreamZero 167 / 640
26.1% 35.5 117 / 390
30.0% 48.4 24 / 170
14.1% 35.9
4 π0-FAST 129 / 640
20.2% 26.7 52 / 390
13.3% 29.2 5 / 170
2.9% 21.9
5 GR00T N1.6 56 / 640
8.8% 13.6 31 / 390
7.9% 25.1 0 / 170
0.0% 11.9
6 π0 46 / 640
7.2% 9.6 14 / 390
3.6% 18.1 0 / 170
0.0% 8.7
7 paligemma-binning 22 / 640
3.4% 6.2 19 / 390
4.9% 16.2 0 / 170
0.0% 9.0

Results by Competency Axes

Performance across competency axes: Visual (color, semantics, size), Procedural (action-oriented reasoning across affordances, reorientation, and stacking), and Relational (interpreting conjunctions, counting, and spatial relationships in multi-object instructions).

Relational Visual Procedural
# Policy N SR% Score N SR% Score N SR% Score
1 Cosmos3-Nano-Policy 207 / 420
49.3% 60.9 219 / 840
26.1% 42.9 60 / 220
27.3% 46.8
2 π0.5 142 / 420
33.8% 46.8 200 / 840
23.8% 38.8 48 / 220
21.8% 41.6
3 DreamZero 140 / 420
33.3% 45.5 146 / 840
17.4% 31.9 58 / 220
26.4% 42.3
4 π0-FAST 98 / 420
23.3% 36.7 74 / 840
8.8% 19.5 14 / 220
6.4% 16.4
5 π0 27 / 420
6.4% 18.8 16 / 840
1.9% 8.2 1 / 220
0.5% 6.6
6 GR00T N1.6 42 / 420
10.0% 23.2 42 / 840
5.0% 15.1 6 / 220
2.7% 11.1
7 paligemma-binning 24 / 420
5.7% 15.6 23 / 840
2.7% 9.3 1 / 220
0.5% 4.0

Sim-Real Correlation

Benchmark-level Correlation with Real World Experiments

Performance ranking correlate strongly with RoboArena Elo scores, an open-source real-world benchmark for generalist robot policies. This means that at a benchmark level, policies rank in our benchmark as they do in the real world.

Spearman ρ

Adding Your Model

We are actively building a leaderboard for the community to submit open-source policies to.
In the meantime, if you have a model that you'd like to see featured, get in touch!