Cosmos-Embed1 — Cosmos Lab

Abstract

Cosmos Embed1 is a joint video-text embedder tailored for physical AI. Multi-modal embeddings, particularly joint video-text embedders, are critical for physical AI development pipelines. They enable essential data curation tasks including text-to-video search, inverse video search, semantic deduplication, and targeted filtering. Additionally, these embeddings can also serve as representations to condition on for downstream physical AI models. While existing video-text embedders perform well in general domains, they underperform substantially on physical AI tasks. To bridge this gap, we introduce Cosmos Embed1, a joint video-text embedder specifically tailored for physical AI applications.

Cosmos Embed1 architecture is based on QFormer, with modifications for processing video inputs. The video embedder processes frames individually using a ViT backbone. Per-frame ViT features are concatenated along the temporal dimension and augmented with temporal embeddings. The resulting sequence is passed to a QFormer module, which uses cross-attention to extract a compact set of visual query tokens from the input frames. The visual query tokens are pooled to produce a single video embedding. The text embedder processes tokenized text through the self-attention branch of the QFormer to generate a text embedding. Text and video embeddings are L2-normalized and aligned using contrastive video-text loss, supplemented by video-text matching and video captioning losses. We release three versions of Cosmos-Embed1 — optimized to process 8 frames per video at 224p, 336p, and 448p resolutions respectively.

Cosmos-Embed1 was trained on a curated dataset of nearly 8M paired video-text samples spanning diverse domains: general-purpose videos, human actions, robotic tasks, and ego-centric vehicle scenarios. The model supports text-to-video retrieval, inverse video search, semantic deduplication, kNN classification, and serves as a base for video curation tasks. Cosmos-Embed1 achieves SOTA performance on AV and robotics datasets while maintaining competitive performance in general domains.

Intermediate Feature Visualizations

Analysis of intermediate video features (via robust PCA) suggests that Cosmos-Embed1 (1B) captures dense semantic information more effectively than other video-text counterparts like InternVideo2-s2 (6B) and Perception Encoder Core-G (2B). This potential enhancement in semantic understanding appears evident in the model's ability to maintain reasonably consistent feature representations across frames, likely resulting from the training-time embedding distillation objectives.

Cosmos-Embed1 (1B)

InternVideo2-s2 (6B)

Perception Encoder Core-G (2B)

Quantitative Results

General & Actions Domain

Cosmos-Embed1 is competitive with contemporary state-of-the-art embedding models in general video retrieval, and outperforms the state-of-the-art in action recognition.

Embedding Model	General Retrieval — MSR-VTT		Kinetics-400 (val)		Kinetics-600 (val)		Kinetics-700 (val)
Embedding Model	T2V R@1	V2T R@1	Accuracy	F1-Score	Accuracy	F1-Score	Accuracy	F1-Score
InternVideo2-s2 (6B)	48.00	46.40	62.64	60.58	59.75	57.14	52.38	49.81
Perception Encoder Core-448p-G (2B)	51.00	49.90	76.71	76.00	76.02	75.14	68.84	68.28
Cosmos-Embed1-336p (1B)	47.60	47.60	87.77	87.66	88.19	88.06	74.74	74.57

Retrieval metrics with DSL re-ranking, linearly spaced frames with square resizing for all methods, at bfloat16 precision. For action-recognition in Kinetics, we use the standard templates from PE and take the max similarity of all templates per class.

Robotics Domain

Cosmos-Embed1 outperforms contemporary state-of-the-art embedding models in robotics scenario retrieval across different robot embodiments and viewpoints.

Embedding Model	Scenario Retrieval
	1x Robotics		AgiBot		Bridge		RT-1
	T2V R@1	V2T R@1	T2V R@1	V2T R@1	T2V R@1	V2T R@1	T2V R@1	V2T R@1
InternVideo2-s2 (6B)	36.50	30.00	1.20	7.82	8.17	6.16	3.78	2.45
Perception Encoder Core-448p-G (2B)	36.50	24.50	1.16	0.83	7.19	5.24	2.63	2.58
Cosmos-Embed1-336p (1B)	61.50	65.50	7.04	6.33	24.51	22.90	19.94	20.21

Retrieval metrics with DSL re-ranking, linearly spaced frames with square resizing for all methods, at bfloat16 precision.

AV Domain

Cosmos-Embed1 outperforms contemporary state-of-the-art embedding models in AV action recognition (Internal AV Actions dataset with 22 action classes) and scenario retrieval (OpenDV).

Embedding Model	Action Recognition — AV Actions (Internal, 22 Classes)		Scenario Retrieval — OpenDV
Embedding Model	Accuracy	F1-Score	T2V R@1	V2T R@1
InternVideo2-s2 (6B)	12.71	10.54	7.33	6.60
Perception Encoder Core-448p-G (2B)	23.01	21.89	9.58	9.30
Cosmos-Embed1-336p (1B)	72.23	72.28	34.42	34.67

Retrieval metrics with DSL re-ranking, linearly spaced frames with square resizing for all methods, at bfloat16 precision. For action-recognition in AV Actions (Internal), we use prompt templates specific to ego-vehicle actions.

Cosmos-Embed1 Search Demo

Cosmos-Embed1 in action with Text-to-Video and Video-to-Video search on real-world examples.

Citation

Please cite as NVIDIA et al. using the following BibTex:

@misc{nvidia2025cosmosembed1,
  title={Cosmos-Embed1: A Joint Video-Text Embedder for Physical AI},
  author={NVIDIA and Ferroni, Francesco and Chattopadhyay, Prithvijit and Heinrich, Greg and Ranzinger, Mike and Amoroso, Roberto and Luo, Alice and Wang, Andrew and Liu, Ming-Yu},
  url={https://research.nvidia.com/labs/dir/cosmos-embed1/},
  year={2025}
}