Vesta
A Generalist Embodied Reasoning Model

June 2026
Vesta interleaves acting, thinking, and tool calls to complete a long-horizon mail-sorting task

Figure 1. Vesta interleaves acting, thinking, and tool use to solve a long-horizon task: it reads a letter's address, searches the web to determine the destination is international, and places the mail in the correct mailbox.

Abstract

Robots operating in open-world environments must seamlessly integrate localization, spatial reasoning, navigation, and long-horizon planning. While specialist models excel at individual tasks, deploying a multi-model stack is computationally expensive and prone to cascading errors.

We present Vesta, a unified embodied generalist that consolidates these capabilities into a single foundation model. Our approach combines a diverse and massive curated corpus designed to induce spatial grounding with a simple multimodal memory harness that enables reasoning over extended time horizons. Across diverse benchmarks, Vesta on average beats individual SOTA baselines by >20% and beats an ensemble of per-category-best baselines by >10% — demonstrating that a generalist model can match or exceed specialists. On real-world robotic tasks requiring memory and reasoning, Vesta improves task success by >35%. Our work demonstrates that a single generalist is a feasible, scalable, and arguably preferable alternative to combining specialists.

  • +20% over the strongest single baseline (avg. across capabilities)
  • +10% over an oracle ensemble of per-category best baselines
  • +38.3% real-robot success on memory-heavy tasks vs. actor-only
  • 4-in-1 localization, navigation, reasoning & planning unified
Vesta overview

Figure 2. Vesta unifies localization, navigation, embodied reasoning, and action planning into a single generalist model. It scores over 20 points above the average prior baseline and >10 points above the strongest baseline in each individual category. On real robots, it improves success by 38.3% on memory-heavy tasks.

Overview

Modern embodied stacks decouple high-level reasoning from low-level execution: a planner Vision-Language Model (VLM) generates instructions that a specialized action model executes. The academic literature typically develops these planner capabilities in silos — navigation, memory, and reasoning specialists each tuned for their own benchmark. Such modularity introduces latency, complicates the inference stack, and is prone to cascading failures.

Vesta instead posits that these capabilities can — and should — be unified into a single generalist planner, built on three techniques: (1) a curated SFT corpus covering grounding, navigation, embodied reasoning, and real-robot data; (2) a simple multimodal memory harness interleaving history frames with a running textual cache of past subtasks; and (3) empirical validation on a real bimanual robot platform.

Vesta concept

Figure 3. Vesta supports multimodal inputs — multi-view images/videos, language instructions, and episodic memory — alongside hierarchical control, unifying localization, navigation, embodied question-answering, and real-world planning in one model.

Method

Vesta is fine-tuned from the Qwen3-VL-8B base model. Its supervised fine-tuning (SFT) strategy builds base capabilities across four axes, with significant effort invested in data curation.

  • 🎯 Localization — Grounding and pointing to predict contact/manipulation points. Built on large-scale detection data (Objects365, COCO, LVIS) plus an embodied tail with egocentric, manipulation-centric annotations. All boxes & points are decoded as text tokens.
  • 🧭 Navigation — R2R-style Vision-and-Language Navigation. At each high-level step the planner emits a pixel goal, turn sequence, or stop; low-level motion is handled by a navigation backend. Trained on R2R, RxR, and ScaleVLN in simulation.
  • 🧠 Embodied Reasoning — Extends spatial localization to action-conditioned scene understanding — affordance & placement prediction, trajectory generation, and task-progress estimation — for unified what, where, how, and when reasoning.
  • 🤖 Action Planning + Memory — Predicts the next subtask in text from egocentric video. A 4-phase output (Observation → Progress → Reasoning → Action) is written to an explicit memory harness that re-injects a curated history of frames and past subtasks.

The SFT mix spans six categories and is intentionally biased toward spatially grounded capabilities — Spatial Intelligence (27.1%), Navigation (21.8%), and Grounding (20.8%) form the bulk, with General VLM (16.2%), Embodied Reasoning (9.8%), and Real Robots (4.3%) rounding it out. The model is trained for 1 epoch over the full mixture with a learning rate of 1e-5 and weight decay 0.01, on 128 H100 GPUs with a batch size of 256.

Demo

Manipulation. We show Vesta's agentic planning capability by combining it with a GR00T N1.6 policy model, and conduct experiments on the YAM robot arms. We first show how the model puts a specific number of fruits into the picnic bowl and then closes it. GR00T N1.6 alone cannot complete this task as it does not have any memory. Secondly we show how the model packs candy into a box, closes the box, and then places it into the matching tray. At last we demonstrate how the model can make agentic tool calls to determine if a letter should be sorted as domestic or international.

Manipulation Demo Videos

Navigation. We further show a few snippets from navigation trajectories collected inside a humanoid G1 simulator. The Vesta model navigates by providing 2D way points which the low-level SONIC-based controller translates into robot movements.

Navigation Demo Videos

Results

Across embodied cognition, localization, action planning, and navigation, a single Vesta checkpoint matches or beats SOTA specialists of the same size. Vesta's column is highlighted; bold marks the best result in each row.

CognitionVestaRynnBrainRoboBrain 2.5Qwen3-VL
Open-X VQA89.374.052.959.8
SAT81.370.067.365.3
VSI-Bench64.571.042.960.3
MMSI-Bench40.839.629.430.8
ERQA44.946.844.044.8
MindCube-Tiny80.956.629.236.0
CV-Bench88.187.787.686.2
PAI-U57.956.655.057.9
EgoTaskQA81.972.585.057.8
RoboSpatial57.873.173.058.2
Average68.764.856.655.7

Table 1 — Embodied cognition (8B models). Vesta achieves the highest average, leading the majority of individual benchmarks.

LocalizationVestaRynnBrainRoboBrain 2.5Qwen3-VL
CrossPoint76.044.375.428.7
EmbSpatial81.979.375.878.5
Where2Place68.366.966.064.7
RefSpatial59.959.260.553.4
PointBench63.259.769.161.4
Average69.961.969.457.3

Table 2 — Localization (8B models). Vesta achieves the highest average across localization benchmarks.

ModelCDPFSPFSRSDiverseAvg.
RoboBrain-2.5-8B35.381.615.938.333.027.038.5
Qwen3-VL-8B36.767.818.122.130.226.733.6
RynnBrain-8B38.769.516.018.432.426.033.5
Vesta74.491.064.080.382.360.575.4

Table 3 — Real-world action planning. Zero-shot planning on AgiBot (CD = Clear Desk, PF = Place Fruit, SP = Sort Parts, FS = Fold Shirts, RS = Refill Shelf) and an internal Egocentric Human-Hand suite of diverse tasks. Vesta outperforms all baselines by a wide margin.

ModelSR ↑NE ↓OS ↑SPL ↑
RynnBrain-8B0.08.860.00.0
RoboBrain-2.5-8B0.09.030.00.0
Qwen3-VL-8B0.08.830.00.0
UniNaVid47.05.5853.342.7
InternVLA-N1-8B (specialist)55.44.8960.652.1
Vesta55.55.1661.450.8

Table 4 — Navigation in R2R-CE. On the R2R val-unseen split, Vesta ties the SOTA navigation specialist InternVLA-N1 (leading on SR and OS) while every generalist baseline fails out-of-domain. ↑ higher is better, ↓ lower is better.

Real-Robot Evaluation

Vesta is deployed as the planner on tabletop bimanual YAM grippers across three reasoning- and memory-heavy tasks — Find Object, Count Fruits, and Memorize Candy — using GR00T N1.6 as the low-level actor. Using Vesta as the planner improves average success by 38.3% over the actor-only baseline and 25% over a Qwen3-VL planner, with statistical significance over 4σ.

Real robot tasks

Figure 4. Real robot tasks on a bimanual platform: count fruits, locate object, and memorize candy. Each requires the planner to track progress and recall past observations across a long horizon.

Real robot evaluation

Figure 5. Across all tasks, Vesta-as-planner significantly beats the actor-only and Qwen3-VL planner baselines, lifting average success rate by 38.3% over actor-only.

Acknowledgements

We thank Cristaldo Campos, Curie Park, Yuhe Zhang, and the AutoModel team for their valuable support and contributions.

BibTeX

@techreport{bjorck2026vesta,
title   = {Vesta: A Generalist Embodied Reasoning Model},
author  = {Bjorck, Johan and Li, Zhiqi and Man, Yunze and Wang, Jing
           and Cheng, An-Chieh and Liu, Sifei and Wang, Shihao and Yu, Zhiding
           and Fan, Linxi and Zhu, Yuke and Kautz, Jan and others},
institution = {NVIDIA},
year    = {2026},
type    = {Technical Report}
}
Authors

Johan Bjorck*, Zhiqi Li*, Yunze Man*1, Jing Wang*, An-Chieh Cheng†,2, Sifei Liu, Shihao Wang†,1,3, Zhiding Yu, Abhishek Badki, Stan Birchfield, Valts Blukis, Yevgen Chebotar, Siyi Chen4, Sicong Leng5, Yu-Cheng Chou6, Tianli Ding, Boyi Li, Zhengyi Luo, Hang Su, Jonathan Tremblay, Tingwu Wang, Bowen Wen, Jimmy Wu, Xianghui Xie7, Hanrong Ye, Hongxu Yin, K.R. Zentner, Liangyan Gui1, Yu-Xiong Wang1, Yuke Zhu, Linxi "Jim" Fan, Jan Kautz

* Co-first authors (alphabetical order)  ·  † Core authors (alphabetical order)  ·  ‡ Joint advising  ·  Unmarked: NVIDIA
1 UIUC   2 UC San Diego   3 HK Polytechnic University   4 University of Michigan   5 Nanyang Technological University   6 Johns Hopkins University   7 University of Tübingen