Performance of open-sourced task generalist models across three levels of language specificity.
Each task is repeated with 10 episodes. All models were fine-tuned only on the real-world DROID dataset. End effector velocity does not consider inference latency.
SR% = binary success rate; Score = subtask-progress score (×100).
| # | Policy | Type | N | SR% | Score | EE Speed | EE SPARC | |
|---|---|---|---|---|---|---|---|---|
| 1 | Cosmos3-Nano-Policy | WAM | 442 / 1200 | 36.8% | 51.9 | 7.13 cm/s | −5.99 | |
| 2 | π0.5 | VLA | 336 / 1200 | 28.0% | 43.4 | 5.35 cm/s | −8.34 | |
| 3 | DreamZero | WAM | 308 / 1200 | 25.7% | 39.8 | 3.64 cm/s | −6.41 | |
| 4 | π0-FAST | VLA | 186 / 1200 | 15.5% | 26.9 | 4.60 cm/s | −9.63 | |
| 5 | GR00T N1.6 | VLA | 86 / 1200 | 7.2% | 17.1 | 4.30 cm/s | −6.87 | |
| 6 | π0 | VLA | 60 / 1200 | 5.0% | 12.2 | 4.18 cm/s | −9.49 | |
| 7 | paligemma-binning | VLA | 41 / 1200 | 3.4% | 9.9 | 1.93 cm/s | −16.52 |
Difficulty reflects the number of subtasks and reasoning complexity per task, grouped into 3 difficulty levels.
| Simple | Moderate | Complex | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # | Policy | N | SR% | Score | N | SR% | Score | N | SR% | Score | |||
| 1 | Cosmos3-Nano-Policy | 260 / 640 | 40.6% | 51.0 | 138 / 390 | 35.4% | 52.3 | 43 / 170 | 25.3% | 54.4 | |||
| 2 | π0.5 | 190 / 640 | 29.7% | 38.9 | 123 / 390 | 31.5% | 50.5 | 23 / 170 | 13.5% | 44.1 | |||
| 3 | DreamZero | 167 / 640 | 26.1% | 35.5 | 117 / 390 | 30.0% | 48.4 | 24 / 170 | 14.1% | 35.9 | |||
| 4 | π0-FAST | 129 / 640 | 20.2% | 26.7 | 52 / 390 | 13.3% | 29.2 | 5 / 170 | 2.9% | 21.9 | |||
| 5 | GR00T N1.6 | 56 / 640 | 8.8% | 13.6 | 31 / 390 | 7.9% | 25.1 | 0 / 170 | 0.0% | 11.9 | |||
| 6 | π0 | 46 / 640 | 7.2% | 9.6 | 14 / 390 | 3.6% | 18.1 | 0 / 170 | 0.0% | 8.7 | |||
| 7 | paligemma-binning | 22 / 640 | 3.4% | 6.2 | 19 / 390 | 4.9% | 16.2 | 0 / 170 | 0.0% | 9.0 | |||
Performance across competency axes: Visual (color, semantics, size), Procedural (action-oriented reasoning across affordances, reorientation, and stacking), and Relational (interpreting conjunctions, counting, and spatial relationships in multi-object instructions).
| Relational | Visual | Procedural | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # | Policy | N | SR% | Score | N | SR% | Score | N | SR% | Score | |||
| 1 | Cosmos3-Nano-Policy | 207 / 420 | 49.3% | 60.9 | 219 / 840 | 26.1% | 42.9 | 60 / 220 | 27.3% | 46.8 | |||
| 2 | π0.5 | 142 / 420 | 33.8% | 46.8 | 200 / 840 | 23.8% | 38.8 | 48 / 220 | 21.8% | 41.6 | |||
| 3 | DreamZero | 140 / 420 | 33.3% | 45.5 | 146 / 840 | 17.4% | 31.9 | 58 / 220 | 26.4% | 42.3 | |||
| 4 | π0-FAST | 98 / 420 | 23.3% | 36.7 | 74 / 840 | 8.8% | 19.5 | 14 / 220 | 6.4% | 16.4 | |||
| 5 | π0 | 27 / 420 | 6.4% | 18.8 | 16 / 840 | 1.9% | 8.2 | 1 / 220 | 0.5% | 6.6 | |||
| 6 | GR00T N1.6 | 42 / 420 | 10.0% | 23.2 | 42 / 840 | 5.0% | 15.1 | 6 / 220 | 2.7% | 11.1 | |||
| 7 | paligemma-binning | 24 / 420 | 5.7% | 15.6 | 23 / 840 | 2.7% | 9.3 | 1 / 220 | 0.5% | 4.0 | |||
Performance ranking correlate strongly with RoboArena Elo scores, an open-source real-world benchmark for generalist robot policies. This means that at a benchmark level, policies rank in our benchmark as they do in the real world.
We are actively building a leaderboard for the community to submit open-source policies to.
In the meantime, if you have a model that you'd like to see featured, get in touch!