Cosmos 3

Multiple modalities, many applications.many applications.
One single model.

Cosmos 3 connects understanding, generation, simulation, and action through a shared omnimodal world model that moves fluidly across text, images, video, audio, and actions.

Language

Image

Video

Audio

Action

Language

Image

Video

Audio

Action

Explore how Cosmos 3 couples different modalities with each capability.
Check out the results by selecting the task.

Vision-Language Reasoning

Reason through the physical world.

Cosmos 3 grounds language in images and video, reading spatial relationships, temporal cues, object states, and actions as shared context for deeper physical reasoning.

Robot gripper trajectory reasoning input image

2D grounding input image with IMPOSSIBLE labels on stacked boxes

Image Generation

Create scenes with physical detail.

Cosmos 3 turns language descriptions into realistic images, preserving object detail, spatial layout, and physical cues for downstream generation.

Fixed-wing survey drone banking over a mountain forest

Laser cutting gantry tracing pattern lines on black fabric

Gantry robot applying adhesive to an automotive windshield

Two industrial robot arms lifting an engine block

Surgical robot arm holding a scalpel above a phantom model

Autonomous forklift moving a wrapped pallet in cold storage

Automated guided cart carrying totes through a factory lane

Robotic grippers lacing a brown leather sneaker

Audio-Visual Generation

Generate worlds with sight and sound.

Cosmos 3 creates physically aware video from text, images, or clips, and can pair motion with audio that follows visible events, source movement, and scene context.

Robot Policy

Turn perception into action.

Cosmos 3 adapts the same omnimodal backbone to follow instructions and translate visual context into purposeful planning and manipulation.

Forward Dynamics

Simulate the future world.

Cosmos 3 conditions on observations and controls to roll out future videos, helping agents preview outcomes for planning, evaluation, and data generation.

Input camera

Input hand pose

Input camera

Input hand pose

Input camera

Input hand pose

Input camera

Input hand pose

Input camera

Input hand pose

Input camera

Inverse Dynamics

Infer the action behind change.

Cosmos 3 reads observed states and recovers the trajectory or control that explains how the scene changed, turning video evidence into actions.

Input video

Reasoning + Generation

Plan first, then bring it to life.

Cosmos 3 can reason over an input scene, produce an actionable trajectory, and use that structured intent to generate the resulting physical interaction.

You are given the task "pick the Cosmos 3 Omni block from the bottom drawer and place it on the counter". First, identify all relevant objects and free spaces in the scene, and then specify the 2D trajectory your end-effector should follow in pixel space. Return the trajectory as (x, y) coordinates. Answer the question using the following format: <think> Your reasoning. </think> Write the final 2D end-effector trajectory immediately after the </think> tag.

<think>
I can see the Cosmos 3 Omni block inside the bottom drawer, which is the object I need to pick up. My gripper is positioned to reach into the drawer and grasp the block. The bottom drawer provides the storage space where the block is located, and I need to access it to complete the task. Once I retrieve the block, I will place it on the counter surface, which is the designated target area. The open space inside the bottom drawer allows me to reach and grasp the block, while the clear surface area on the counter near the drawer opening provides the necessary space to place the block after retrieval. The other objects on the counter are not directly involved in the task but help define the workspace layout.

Based on this analysis, I need to locate the robot gripper, the Cosmos 3 Omni block, the bottom drawer, the counter surface, the yellow-and-black forklift, the white SUV, the white truck, and the robot with the "Physical AI Builder" sign. Their coordinates are as follows: robot gripper (514, 769), Cosmos 3 Omni block (471, 780), bottom drawer (400, 760), counter surface (460, 310), yellow-and-black forklift (235, 304), white SUV (605, 105), white truck (720, 163), robot with "Physical AI Builder" sign (790, 254). The open space inside the bottom drawer is at (380, 760), and the clear surface area on the counter near the drawer opening is at (400, 760). These free-space locations are the relevant points for reaching and placing the block.

I will move my gripper from its current position at (514, 769) to the Cosmos 3 Omni block at (471, 780) to grasp it. After securing the block, I will lift it and move it upward and slightly forward to clear the bottom drawer, which is located at (400, 760). Then I will carry the block to the counter surface at (460, 310), making sure to place it on the clear area near the drawer opening at (400, 760) to complete the task.

The planned trajectory is as follows: [514, 769], [507, 783], [500, 471], [464, 278], [460, 275]</think>

(514, 769) Move the gripper to the Cosmos 3 Omni block
(507, 783) Grasp the Cosmos 3 Omni block
(500, 471) Lift the block out of the bottom drawer
(464, 278) Move the block toward the counter surface
(460, 275) Place the block on the counter surface

Top open foundation for Physical AI.

Cosmos 3 brings leading reasoning, generation, and action performance into open models researchers and builders can inspect, adapt, and deploy.

Reasoning

Leading Open Reasoner for Physical AI

Cosmos 3 ranks #1 among open models on Robotics, Smart Space, and Driving benchmark averages, showing strong physical-world understanding.

View Benchmark Results

Generation

Leading Open Generator for Physical AI

Cosmos 3 ranks #1 among open models for text-to-image, image-to-video, and robot policy across R-Bench, Artificial Analysis, RoboLab, and RoboArena benchmarks.

View Benchmark Results

Learn more and get started with Cosmos 3.

Technical Report

Model Cards GitHub Code Product Website

Please cite as NVIDIA et al. using the following BibTex:

@article{nvidia2026cosmos3,
  title = {Cosmos 3: Omnimodal World Models for Physical AI},
  author = {{NVIDIA}},
  journal = {arXiv preprint arXiv:2606.02800},
  year = {2026},
  url = {https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf}
}

Cosmos 3

Multiple modalities, many applications.many applications.One single model.

Reason through the physical world.

Create scenes with physical detail.

Generate worlds with sight and sound.

Turn perception into action.

Simulate the future world.

Infer the action behind change.

Plan first, then bring it to life.

Top open foundation for Physical AI.

Leading Open Reasoner for Physical AI

Leading Open Generator for Physical AI

Multiple modalities, many applications.many applications.
One single model.