Abstract

Cosmos-Reason1 is a suite of models, ontologies, and benchmarks that we develop with the goal of enabling multimodal LLMs to generate physically grounded responses. We release a multimodal LLM: Cosmos-Reason1-7B which is trained in two stages: Physical AI SFT and Physical AI reinforcement learning. We define ontologies for physical common sense and embodied reasoning, and also build benchmarks to evaluate Physical AI reasoning capabilities of multimodal LLMs.

Overview

Cosmos-Reason1 overview

Cosmos-Reason1 is a suite of models, ontologies, and benchmarks that we develop with the goal of enabling multimodal LLMs to generate physically grounded responses. We release a multimodal LLM: Cosmos-Reason1-7B which is trained in two stages: Physical AI SFT and Physical AI reinforcement learning. We define ontologies for physical common sense and embodied reasoning, and also build benchmarks to evaluate Physical AI reasoning capabilities of multimodal LLMs.

Multimodal architecture

An illustration of our multimodal large language model architecture. Given an input video and an input text prompt, the video is projected into the LLM's token embedding space as video tokens by a vision encoder followed by a projector. The text tokens are concatenated with the video tokens and fed into the LLM backbone, a hybrid Mamba-MLP-Transformer architecture. Our model can output responses with long chain-of-thought reasoning processes.

Results

Physical Common Sense Benchmark

Physical common sense benchmark across Space, Time, and Other Physics categories. ↑ = higher is better. Improvement over next-best model shown in green.

Methods Space Time Other Physics Avg.
Gemini 2.0 Flash 53.8 50.0 46.9 50.2
GPT-4o 61.3 54.7 50.9 55.6
OpenAI o1 63.8 58.1 58.0 59.9
Qwen2.5-VL-7B 48.8 56.4 37.2 47.4
Nemotron-H-56B 61.3 68.1 45.1 58.2
Cosmos-Reason1-7B 54.2 58.7 50.0 54.3 (+6.9)
Cosmos-Reason1-56B 61.3 65.5 53.9 60.2 (+2.0)

Embodied Reasoning Benchmark

Embodied reasoning benchmark across robotics and AV domains. ↑ = higher is better.

Models BridgeData V2 RoboVQA Agibot HoloAssist AV RoboFail Avg.
Gemini 2.0 Flash 25.0 78.2 29.0 44.0 37.0 67.0 46.7
GPT-4o 42.0 71.8 32.0 65.0 46.0 63.0 53.3
OpenAI o1 42.0 80.0 44.0 63.0 37.0 61.0 54.5
Qwen2.5-VL-7B 38.0 82.5 40.4 50.0 36.0 57.6 50.7
Nemotron-H-56B 37.0 77.2 37.0 65.0 41.0 64.0 53.5
Cosmos-Reason1-7B 58.8 83.8 49.4 63.0 55.6 60.0 61.8 (+11.1)
Cosmos-Reason1-56B 65.0 80.0 47.6 57.8 65.8 66.2 63.7 (+10.2)

Physical Common Sense and Embodied Reasoning — Effect of Physical AI RL

Combined evaluation showing effect of Physical AI RL training.

Models Common Sense BridgeData V2 RoboVQA Agibot HoloAssist AV RoboFail Avg.
Cosmos-Reason1-7B 54.3 58.8 83.8 49.4 63.0 55.6 60.0 60.7
+ Physical AI RL 56.2 73.5 86.8 54.2 60.0 67.0 62.0 65.7 (+5.0)

Effect of Physical AI RL — Demonstrations

Intuitive Physics Benchmark

Intuitive physics benchmark. ↑ = higher is better.

Models Arrow of Time Spatial Puzzle Object Permanence Avg.
Random Guess 50.0 25.0 50.0 41.7
Gemini 2.0 Flash 50.0 31.0 48.0 43.0
GPT-4o 50.0 77.0 48.0 58.3
OpenAI o1 51.0 64.0 49.0 54.7
Qwen2.5-VL-7B 50.2 27.2 48.8 42.1
Cosmos-Reason1-7B 56.0 85.4 82.0 74.5 (+32.4)
+ Physical AI RL 64.5 94.0 86.0 81.5 (+7.0)

Demonstrations

Output Reasoning Traces

Below are some of the model's answers and reasonings given the video and the question.

Citation

Please cite as NVIDIA et al. using the following BibTex:

@article{nvidia2025cosmosreason1physicalcommonsense,
  title={Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning},
  author={NVIDIA and Azzolini, Alisson and Brandon, Hannah and Chattopadhyay, Prithvijit and Chen, Huayu and Chu, Jinju and Cui, Yin and Diamond, Jenna and Ding, Yifan and Ferroni, Francesco and Govindaraju, Rama and Gu, Jinwei and Gururani, Siddharth and El Hanafi, Imad and Hao, Zekun and Huffman, Jacob and Jin, Jingyi and Johnson, Brendan and Khan, Rizwan and Kurian, George and Lantz, Elena and Lee, Nayeon and Li, Zhaoshuo and Li, Xuan and Lin, Tsung-Yi and Lin, Yen-Chen and Liu, Ming-Yu and Mathau, Andrew and Ni, Yun and Pavao, Lindsey and Ping, Wei and Romero, David W. and Smelyanskiy, Misha and Song, Shuran and Tchapmi, Lyne and Wang, Andrew Z. and Wang, Boxin and Wang, Haoxiang and Wei, Fangyin and Xu, Jiashu and Xu, Yao and Yang, Xiaodong and Yang, Zhuolin and Zeng, Xiaohui and Zhang, Zhe},
  journal={arXiv preprint arXiv:2503.15558},
  year={2025}
}