Cosmos-Embed1 teaser animation

Abstract

Cosmos Embed1 is a joint video-text embedder tailored for physical AI. Multi-modal embeddings, particularly joint video-text embedders, are critical for physical AI development pipelines. They enable essential data curation tasks including text-to-video search, inverse video search, semantic deduplication, and targeted filtering. Additionally, these embeddings can also serve as representations to condition on for downstream physical AI models. While existing video-text embedders perform well in general domains, they underperform substantially on physical AI tasks. To bridge this gap, we introduce Cosmos Embed1, a joint video-text embedder specifically tailored for physical AI applications.

Cosmos Embed1 architecture is based on QFormer, with modifications for processing video inputs. The video embedder processes frames individually using a ViT backbone. Per-frame ViT features are concatenated along the temporal dimension and augmented with temporal embeddings. The resulting sequence is passed to a QFormer module, which uses cross-attention to extract a compact set of visual query tokens from the input frames. The visual query tokens are pooled to produce a single video embedding. The text embedder processes tokenized text through the self-attention branch of the QFormer to generate a text embedding. Text and video embeddings are L2-normalized and aligned using contrastive video-text loss, supplemented by video-text matching and video captioning losses. We release three versions of Cosmos-Embed1 — optimized to process 8 frames per video at 224p, 336p, and 448p resolutions respectively.

Cosmos-Embed1 was trained on a curated dataset of nearly 8M paired video-text samples spanning diverse domains: general-purpose videos, human actions, robotic tasks, and ego-centric vehicle scenarios. The model supports text-to-video retrieval, inverse video search, semantic deduplication, kNN classification, and serves as a base for video curation tasks. Cosmos-Embed1 achieves SOTA performance on AV and robotics datasets while maintaining competitive performance in general domains.

Cosmos-Embed1 architecture diagram

Intermediate Feature Visualizations

Analysis of intermediate video features (via robust PCA) suggests that Cosmos-Embed1 (1B) captures dense semantic information more effectively than other video-text counterparts like InternVideo2-s2 (6B) and Perception Encoder Core-G (2B). This potential enhancement in semantic understanding appears evident in the model's ability to maintain reasonably consistent feature representations across frames, likely resulting from the training-time embedding distillation objectives.

Cosmos-Embed1 (1B)
InternVideo2-s2 (6B)
Perception Encoder Core-G (2B)

Quantitative Results

General & Actions Domain

Cosmos-Embed1 is competitive with contemporary state-of-the-art embedding models in general video retrieval, and outperforms the state-of-the-art in action recognition.

Embedding Model General Retrieval — MSR-VTT Kinetics-400 (val) Kinetics-600 (val) Kinetics-700 (val)
T2V R@1 V2T R@1 Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score
InternVideo2-s2 (6B) 48.00 46.40 62.64 60.58 59.75 57.14 52.38 49.81
Perception Encoder Core-448p-G (2B) 51.00 49.90 76.71 76.00 76.02 75.14 68.84 68.28
Cosmos-Embed1-336p (1B) 47.60 47.60 87.77 87.66 88.19 88.06 74.74 74.57

Retrieval metrics with DSL re-ranking, linearly spaced frames with square resizing for all methods, at bfloat16 precision. For action-recognition in Kinetics, we use the standard templates from PE and take the max similarity of all templates per class.

Robotics Domain

Cosmos-Embed1 outperforms contemporary state-of-the-art embedding models in robotics scenario retrieval across different robot embodiments and viewpoints.

Embedding Model Scenario Retrieval
1x Robotics AgiBot Bridge RT-1
T2V R@1 V2T R@1 T2V R@1 V2T R@1 T2V R@1 V2T R@1 T2V R@1 V2T R@1
InternVideo2-s2 (6B) 36.50 30.00 1.20 7.82 8.17 6.16 3.78 2.45
Perception Encoder Core-448p-G (2B) 36.50 24.50 1.16 0.83 7.19 5.24 2.63 2.58
Cosmos-Embed1-336p (1B) 61.50 65.50 7.04 6.33 24.51 22.90 19.94 20.21

Retrieval metrics with DSL re-ranking, linearly spaced frames with square resizing for all methods, at bfloat16 precision.

AV Domain

Cosmos-Embed1 outperforms contemporary state-of-the-art embedding models in AV action recognition (Internal AV Actions dataset with 22 action classes) and scenario retrieval (OpenDV).

Embedding Model Action Recognition — AV Actions (Internal, 22 Classes) Scenario Retrieval — OpenDV
Accuracy F1-Score T2V R@1 V2T R@1
InternVideo2-s2 (6B) 12.71 10.54 7.33 6.60
Perception Encoder Core-448p-G (2B) 23.01 21.89 9.58 9.30
Cosmos-Embed1-336p (1B) 72.23 72.28 34.42 34.67

Retrieval metrics with DSL re-ranking, linearly spaced frames with square resizing for all methods, at bfloat16 precision. For action-recognition in AV Actions (Internal), we use prompt templates specific to ego-vehicle actions.

Cosmos-Embed1 Search Demo

Cosmos-Embed1 in action with Text-to-Video and Video-to-Video search on real-world examples.

Citation

Please cite as NVIDIA et al. using the following BibTex:

@misc{nvidia2025cosmosembed1,
  title={Cosmos-Embed1: A Joint Video-Text Embedder for Physical AI},
  author={NVIDIA and Ferroni, Francesco and Chattopadhyay, Prithvijit and Heinrich, Greg and Ranzinger, Mike and Amoroso, Roberto and Luo, Alice and Wang, Andrew and Liu, Ming-Yu},
  url={https://research.nvidia.com/labs/dir/cosmos-embed1/},
  year={2025}
}