Cosmos-Reason1 — Cosmos Lab

Abstract

Cosmos-Reason1 is a suite of models, ontologies, and benchmarks that we develop with the goal of enabling multimodal LLMs to generate physically grounded responses. We release a multimodal LLM: Cosmos-Reason1-7B which is trained in two stages: Physical AI SFT and Physical AI reinforcement learning. We define ontologies for physical common sense and embodied reasoning, and also build benchmarks to evaluate Physical AI reasoning capabilities of multimodal LLMs.

Overview

Cosmos-Reason1 is a suite of models, ontologies, and benchmarks that we develop with the goal of enabling multimodal LLMs to generate physically grounded responses. We release a multimodal LLM: Cosmos-Reason1-7B which is trained in two stages: Physical AI SFT and Physical AI reinforcement learning. We define ontologies for physical common sense and embodied reasoning, and also build benchmarks to evaluate Physical AI reasoning capabilities of multimodal LLMs.

An illustration of our multimodal large language model architecture. Given an input video and an input text prompt, the video is projected into the LLM's token embedding space as video tokens by a vision encoder followed by a projector. The text tokens are concatenated with the video tokens and fed into the LLM backbone, a hybrid Mamba-MLP-Transformer architecture. Our model can output responses with long chain-of-thought reasoning processes.

Results

Physical Common Sense Benchmark

Physical common sense benchmark across Space, Time, and Other Physics categories. ↑ = higher is better. Improvement over next-best model shown in green.

Methods	Space	Time	Other Physics	Avg.
Gemini 2.0 Flash	53.8	50.0	46.9	50.2
GPT-4o	61.3	54.7	50.9	55.6
OpenAI o1	63.8	58.1	58.0	59.9
Qwen2.5-VL-7B	48.8	56.4	37.2	47.4
Nemotron-H-56B	61.3	68.1	45.1	58.2
Cosmos-Reason1-7B	54.2	58.7	50.0	54.3 (+6.9)
Cosmos-Reason1-56B	61.3	65.5	53.9	60.2 (+2.0)

Embodied Reasoning Benchmark

Embodied reasoning benchmark across robotics and AV domains. ↑ = higher is better.

Models	BridgeData V2	RoboVQA	Agibot	HoloAssist	AV	RoboFail	Avg.
Gemini 2.0 Flash	25.0	78.2	29.0	44.0	37.0	67.0	46.7
GPT-4o	42.0	71.8	32.0	65.0	46.0	63.0	53.3
OpenAI o1	42.0	80.0	44.0	63.0	37.0	61.0	54.5
Qwen2.5-VL-7B	38.0	82.5	40.4	50.0	36.0	57.6	50.7
Nemotron-H-56B	37.0	77.2	37.0	65.0	41.0	64.0	53.5
Cosmos-Reason1-7B	58.8	83.8	49.4	63.0	55.6	60.0	61.8 (+11.1)
Cosmos-Reason1-56B	65.0	80.0	47.6	57.8	65.8	66.2	63.7 (+10.2)

Physical Common Sense and Embodied Reasoning — Effect of Physical AI RL

Combined evaluation showing effect of Physical AI RL training.

Models	Common Sense	BridgeData V2	RoboVQA	Agibot	HoloAssist	AV	RoboFail	Avg.
Cosmos-Reason1-7B	54.3	58.8	83.8	49.4	63.0	55.6	60.0	60.7
+ Physical AI RL	56.2	73.5	86.8	54.2	60.0	67.0	62.0	65.7 (+5.0)

Effect of Physical AI RL — Demonstrations

Intuitive Physics Benchmark

Intuitive physics benchmark. ↑ = higher is better.

Models	Arrow of Time	Spatial Puzzle	Object Permanence	Avg.
Random Guess	50.0	25.0	50.0	41.7
Gemini 2.0 Flash	50.0	31.0	48.0	43.0
GPT-4o	50.0	77.0	48.0	58.3
OpenAI o1	51.0	64.0	49.0	54.7
Qwen2.5-VL-7B	50.2	27.2	48.8	42.1
Cosmos-Reason1-7B	56.0	85.4	82.0	74.5 (+32.4)
+ Physical AI RL	64.5	94.0	86.0	81.5 (+7.0)

Demonstrations

Output Reasoning Traces

Below are some of the model's answers and reasonings given the video and the question.

Citation

Please cite as NVIDIA et al. using the following BibTex:

@article{nvidia2025cosmosreason1physicalcommonsense,
  title={Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning},
  author={NVIDIA and Azzolini, Alisson and Brandon, Hannah and Chattopadhyay, Prithvijit and Chen, Huayu and Chu, Jinju and Cui, Yin and Diamond, Jenna and Ding, Yifan and Ferroni, Francesco and Govindaraju, Rama and Gu, Jinwei and Gururani, Siddharth and El Hanafi, Imad and Hao, Zekun and Huffman, Jacob and Jin, Jingyi and Johnson, Brendan and Khan, Rizwan and Kurian, George and Lantz, Elena and Lee, Nayeon and Li, Zhaoshuo and Li, Xuan and Lin, Tsung-Yi and Lin, Yen-Chen and Liu, Ming-Yu and Mathau, Andrew and Ni, Yun and Pavao, Lindsey and Ping, Wei and Romero, David W. and Smelyanskiy, Misha and Song, Shuran and Tchapmi, Lyne and Wang, Andrew Z. and Wang, Boxin and Wang, Haoxiang and Wei, Fangyin and Xu, Jiashu and Xu, Yao and Yang, Xiaodong and Yang, Zhuolin and Zeng, Xiaohui and Zhang, Zhe},
  journal={arXiv preprint arXiv:2503.15558},
  year={2025}
}