Abstract
Cosmos-Reason1 is a suite of models, ontologies, and benchmarks that we develop with the goal of enabling multimodal LLMs to generate physically grounded responses. We release a multimodal LLM: Cosmos-Reason1-7B which is trained in two stages: Physical AI SFT and Physical AI reinforcement learning. We define ontologies for physical common sense and embodied reasoning, and also build benchmarks to evaluate Physical AI reasoning capabilities of multimodal LLMs.
Overview
Cosmos-Reason1 is a suite of models, ontologies, and benchmarks that we develop with the goal of enabling multimodal LLMs to generate physically grounded responses. We release a multimodal LLM: Cosmos-Reason1-7B which is trained in two stages: Physical AI SFT and Physical AI reinforcement learning. We define ontologies for physical common sense and embodied reasoning, and also build benchmarks to evaluate Physical AI reasoning capabilities of multimodal LLMs.
An illustration of our multimodal large language model architecture. Given an input video and an input text prompt, the video is projected into the LLM's token embedding space as video tokens by a vision encoder followed by a projector. The text tokens are concatenated with the video tokens and fed into the LLM backbone, a hybrid Mamba-MLP-Transformer architecture. Our model can output responses with long chain-of-thought reasoning processes.
Results
Physical Common Sense Benchmark
Physical common sense benchmark across Space, Time, and Other Physics categories. ↑ = higher is better. Improvement over next-best model shown in green.
| Methods | Space | Time | Other Physics | Avg. |
|---|---|---|---|---|
| Gemini 2.0 Flash | 53.8 | 50.0 | 46.9 | 50.2 |
| GPT-4o | 61.3 | 54.7 | 50.9 | 55.6 |
| OpenAI o1 | 63.8 | 58.1 | 58.0 | 59.9 |
| Qwen2.5-VL-7B | 48.8 | 56.4 | 37.2 | 47.4 |
| Nemotron-H-56B | 61.3 | 68.1 | 45.1 | 58.2 |
| Cosmos-Reason1-7B | 54.2 | 58.7 | 50.0 | 54.3 (+6.9) |
| Cosmos-Reason1-56B | 61.3 | 65.5 | 53.9 | 60.2 (+2.0) |
Embodied Reasoning Benchmark
Embodied reasoning benchmark across robotics and AV domains. ↑ = higher is better.
| Models | BridgeData V2 | RoboVQA | Agibot | HoloAssist | AV | RoboFail | Avg. |
|---|---|---|---|---|---|---|---|
| Gemini 2.0 Flash | 25.0 | 78.2 | 29.0 | 44.0 | 37.0 | 67.0 | 46.7 |
| GPT-4o | 42.0 | 71.8 | 32.0 | 65.0 | 46.0 | 63.0 | 53.3 |
| OpenAI o1 | 42.0 | 80.0 | 44.0 | 63.0 | 37.0 | 61.0 | 54.5 |
| Qwen2.5-VL-7B | 38.0 | 82.5 | 40.4 | 50.0 | 36.0 | 57.6 | 50.7 |
| Nemotron-H-56B | 37.0 | 77.2 | 37.0 | 65.0 | 41.0 | 64.0 | 53.5 |
| Cosmos-Reason1-7B | 58.8 | 83.8 | 49.4 | 63.0 | 55.6 | 60.0 | 61.8 (+11.1) |
| Cosmos-Reason1-56B | 65.0 | 80.0 | 47.6 | 57.8 | 65.8 | 66.2 | 63.7 (+10.2) |
Physical Common Sense and Embodied Reasoning — Effect of Physical AI RL
Combined evaluation showing effect of Physical AI RL training.
| Models | Common Sense | BridgeData V2 | RoboVQA | Agibot | HoloAssist | AV | RoboFail | Avg. |
|---|---|---|---|---|---|---|---|---|
| Cosmos-Reason1-7B | 54.3 | 58.8 | 83.8 | 49.4 | 63.0 | 55.6 | 60.0 | 60.7 |
| + Physical AI RL | 56.2 | 73.5 | 86.8 | 54.2 | 60.0 | 67.0 | 62.0 | 65.7 (+5.0) |
Effect of Physical AI RL — Demonstrations
Intuitive Physics Benchmark
Intuitive physics benchmark. ↑ = higher is better.
| Models | Arrow of Time | Spatial Puzzle | Object Permanence | Avg. |
|---|---|---|---|---|
| Random Guess | 50.0 | 25.0 | 50.0 | 41.7 |
| Gemini 2.0 Flash | 50.0 | 31.0 | 48.0 | 43.0 |
| GPT-4o | 50.0 | 77.0 | 48.0 | 58.3 |
| OpenAI o1 | 51.0 | 64.0 | 49.0 | 54.7 |
| Qwen2.5-VL-7B | 50.2 | 27.2 | 48.8 | 42.1 |
| Cosmos-Reason1-7B | 56.0 | 85.4 | 82.0 | 74.5 (+32.4) |
| + Physical AI RL | 64.5 | 94.0 | 86.0 | 81.5 (+7.0) |
Demonstrations
Output Reasoning Traces
Below are some of the model's answers and reasonings given the video and the question.
Citation
Please cite as NVIDIA et al. using the following BibTex:
@article{nvidia2025cosmosreason1physicalcommonsense,
title={Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning},
author={NVIDIA and Azzolini, Alisson and Brandon, Hannah and Chattopadhyay, Prithvijit and Chen, Huayu and Chu, Jinju and Cui, Yin and Diamond, Jenna and Ding, Yifan and Ferroni, Francesco and Govindaraju, Rama and Gu, Jinwei and Gururani, Siddharth and El Hanafi, Imad and Hao, Zekun and Huffman, Jacob and Jin, Jingyi and Johnson, Brendan and Khan, Rizwan and Kurian, George and Lantz, Elena and Lee, Nayeon and Li, Zhaoshuo and Li, Xuan and Lin, Tsung-Yi and Lin, Yen-Chen and Liu, Ming-Yu and Mathau, Andrew and Ni, Yun and Pavao, Lindsey and Ping, Wei and Romero, David W. and Smelyanskiy, Misha and Song, Shuran and Tchapmi, Lyne and Wang, Andrew Z. and Wang, Boxin and Wang, Haoxiang and Wei, Fangyin and Xu, Jiashu and Xu, Yao and Yang, Xiaodong and Yang, Zhuolin and Zeng, Xiaohui and Zhang, Zhe},
journal={arXiv preprint arXiv:2503.15558},
year={2025}
}