Cosmos Tokenizer — Cosmos Lab

Introduction

(a) Continuous tokenizers

(b) Discrete tokenizers

Figure 1: Comparison of continuous (left) and discrete (right) tokenizers on spatio-temporal compression rate (log scale) vs. reconstruction quality (PSNR). Each point shows a tokenizer configuration's compression-quality trade-off. Our tokenizer achieves superior quality even at higher compression rates compared to other methods.

Tokenizers are key components in generative AI, converting raw data into efficient, compressed representations by discovering latent spaces through unsupervised learning. Visual tokenizers specifically transform high-dimensional visual data, like images and videos, into compact semantic tokens, enabling efficient large-scale model training and reducing computational demands for inference. Figure 2 illustrates a tokenization pipeline for videos.

Figure 2: Video tokenization pipeline. An input video is encoded into tokens, which are usually much more compact than the input video. The decoder then reconstructs the input video from the tokens. Tokenizer training is about learning the encoder and decoder to preserve the visual information in the tokens maximally.

There are two types of tokenizers: continuous and discrete. Continuous tokenizers map visual data to continuous embeddings, suitable for models sampling from continuous distributions, such as Stable Diffusion (Rombach et al., 2022). Discrete tokenizers map visual data to quantized indices, used in models like VideoPoet (Kondratyuk et al., 2024) that rely on cross-entropy loss for training, like the GPT models. Figure 3 compares these token types.

(a) Continuous tokenizers

(b) Discrete tokenizers

Figure 3: Tokens along spatial (\(\frac{H}{s_{HW}} \times \frac{W}{s_{HW}}\)) and temporal (\(1+\frac{T}{s_{T}}\)) dimensions, with a spatial compression factor of \(s_{HW}\) and a temporal compression factor of \(s_{T}\). The first temporal token represents the first input frame, enabling joint image (\(T=0\)) and video (\(T>0\)) tokenization in a shared latent space. (a) Continuous latent embeddings with an embedding size of \(C\), and (b) Quantized indices, each color representing a discrete latent code.

Tokenizers must balance high compression with quality, preserving visual details in the latent space. We introduce Cosmos Tokenizer, a comprehensive suite of continuous and discrete visual tokenizers for images and videos that delivers remarkable compression, high-quality reconstruction, and speeds up to 12X faster than previous methods. It supports various image and video types, with flexible compression rates to accommodate diverse computational constraints, as shown in Table 1.

Table 1: Comparison of different visual tokenizers and their capabilities.
Models	Causal	Image	Video	Joint	Discrete	Continuous
FLUX (FLUX, 2024)	—	✓	✗	✗	✗	✓
Open-MAGVIT2 (Luo et al., 2024)	—	✓	✗	✗	✓	✗
LlamaGen (Sun et al., 2024)	—	✓	✗	✗	✓	✗
VideoGPT (Yan et al., 2021)	✗	✗	✓	✗	✓	✗
Omni (Wang et al., 2024)	✗	✓	✓	✓	✓	✓
CogX-VAE (Yang et al., 2024)	✓	✓	✓	✓	✗	✓
Cosmos (Ours)	✓	✓	✓	✓	✓	✓

Cosmos Tokenizer is built on a lightweight, temporally causal architecture, using causal temporal convolution and attention layers to maintain the order of video frames. This unified design allows seamless tokenization for both images and videos.

We train Cosmos Tokenizer on high-resolution images and long videos, covering a wide range of aspect ratios (including 1:1, 3:4, 4:3, 9:16, and 16:9) across diverse categories of data. During inference, it is agnostic to temporal length, allowing it to tokenize beyond its training length.

We evaluated Cosmos Tokenizer on standard datasets, including MS-COCO 2017, ImageNet-1K, FFHQ, CelebA-HQ, and DAVIS. To standardize video tokenizer evaluation, we also curated a new dataset, called TokenBench, covering categories like robotics, driving, and sports, which we have made publicly available at github.com/NVlabs/TokenBench. Results (Figure 1) show Cosmos Tokenizer significantly outperforms existing methods, with a +4 dB PSNR gain on DAVIS videos. It tokenizes up to 12× faster and can encode up to 8-second 1080p and 10-second 720p videos on NVIDIA A100 GPUs with 80GB memory. Pretrained models, with spatial compressions of 8× and 16× and temporal compressions of 4× and 8×, are available at github.com/NVIDIA/Cosmos-Tokenizer.

Architecture

The Cosmos Tokenizer is designed as an encoder-decoder architecture. Given an input video \(x_{0:T} \in \mathbb{R}^{(1+T) \times H \times W \times 3}\) with height \(H\), width \(W\), and frames \(T\), the encoder (\(\mathcal{E}\)) compresses it into tokens \(z_{0:T'} \in \mathbb{R}^{(1+T') \times H' \times W' \times C}\) with spatial and temporal compression factors \(s_{HW}\) and \(s_T\), respectively. The decoder (\(\mathcal{D}\)) reconstructs the video as \(x'_{0:T} \in \mathbb{R}^{(1+T) \times H \times W \times 3}\), mathematically as:

\[x'_{0:T} = \mathcal{D}\Big(\mathcal{E}\big(x_{0:T}\big)\Big).\]

Our architecture employs a temporally causal design, processing only current and past frames at each stage. The tokenizer operates in the wavelet transform space, where a 2-level wavelet transform downsamples inputs by a factor of four, grouping frames as \(\{g_0, g_1, \dots\} = \{x_0, x_{1:4}, x_{5:8}, \dots\}\). Subsequent stages encode these groups causally, ultimately outputting tokens \(z_{0:T'}\). This causal design is essential for physical AI applications, which often operate in a similar causal setting, and the wavelet transform enables effective compression by emphasizing semantic content.

(a) Temporal Causality: Illustration of the temporal causality mechanism, where inputs \(x_0, x_1, \dots, x_{12}\) are processed through grouped intermediate outputs via wavelet transform \(g_0, g_1, \dots\), and further refined by causal spatio-temporal convolution and attention operations.

(b) Network Architecture: The encoder-decoder network structure includes a 3D Haar wavelet, causal residual, downsampling, and spatio-temporal attention blocks. The decoder mirrors the encoder's structure.

Figure 4: Cosmos Tokenizer architecture.

Our encoder stages (after the wavelet transform) consist of residual blocks interleaved with down-sampling blocks. Each encoder block applies a spatio-temporal factorized 3D convolution: a 2D convolution of \(1 \times k \times k\) captures spatial information, followed by a temporal convolution of \(k \times 1 \times 1\) to capture dynamics, with left padding of \(k-1\) ensuring causality. Long-range dependencies are captured through a spatio-temporal factorized causal self-attention with a global support region, e.g., \(1+T'\) in the last encoder block. The decoder mirrors the encoder but replaces downsampling with upsampling. Figure 4 shows the Cosmos Tokenizer architecture.

We use a vanilla autoencoder (AE) to model continuous tokenizers' latent space and Finite-Scalar-Quantization (FSQ) (Mentzer et al., 2023) for discrete tokenizers. The latent dimension for continuous tokenizers is 16, while for discrete tokenizers it is 6, with FSQ levels of (8, 8, 8, 5, 5, 5), resulting in a vocabulary size of 64,000.

Training

We use a joint training strategy, alternating mini-batches of images and videos at a preset frequency. We supervise only the final output of the tokenizer's decoder, without auxiliary losses in the latent spaces (e.g., commitment or KL prior losses). For instance, a VAE for continuous tokenizers would require a KL prior loss, and a VQ-VAE for discrete quantization would require a commitment loss, unlike our vanilla AE and FSQ approach.

We employ a two-stage training scheme. In the first stage, we optimize with the L1 loss that minimizes the pixel-wise RGB difference between the input and reconstructed video (\(\hat{x}_{0:T}\)), given by \[\mathcal{L}_1 = \left\lVert \hat{x}_{0:T} - x_{0:T} \right\rVert_1,\] and the perceptual loss based on the VGG-19 features (Simonyan and Zisserman, 2014), given by, \[\mathcal{L}_{\text{Perceptual}} = \frac{1}{L} \sum_{l=1}^{L} \sum_{t} \alpha_l \left\lVert \texttt{VGG}_l (\hat{x}_t) - \texttt{VGG}_l (x_t) \right\rVert_1,\] where \(\texttt{VGG}_l (\cdot) \in \mathbb{R}^{H \times W \times C}\) is the features from the \(l\)-th layer of a pre-trained VGG-19 network, \(L\) is the number of layers considered, and \(\alpha_{l}\) is the weight of the \(l\)-th layer. When \(x\) is a video (i.e. \(T>0\)), we compute the perceptual loss as the average of the per-frame perceptual losses.

In the second stage, we use the optical flow (\(\texttt{OF}\)) loss (Teed and Deng, 2020) to handle the temporal smoothness of reconstructed videos, \[\mathcal{L}_{\text{Flow}} = \frac{1}{T} \sum_{t=2}^{T+1} \left\lVert \texttt{OF}(\hat{x}_{t}, \hat{x}_{t-1}) - \texttt{OF}(x_{t}, x_{t-1}) \right\rVert_1 + \frac{1}{T} \sum_{t=1}^{T} \left\lVert \texttt{OF}(\hat{x}_{t}, \hat{x}_{t+1}) - \texttt{OF}(x_{t}, x_{t+1}) \right\rVert_1,\] and the Gram-matrix (\(\texttt{GM}\)) loss (Gatys et al., 2016) to enhance the sharpness of reconstructed images, \[\mathcal{L}_{\text{Gram}} = \frac{1}{L} \sum_{l=1}^{L} \sum_t \alpha_l \left\lVert \texttt{GM}_l (\hat{x}_t) - \texttt{GM}_l (x_t) \right\rVert_1.\]

Additionally, we use adversarial loss in the fine-tuning stage to further enhance reconstruction details, particularly at large compression rates.

Evaluation

We extensively evaluated our Cosmos Tokenizer suite on diverse image and video benchmark datasets. For image tokenizers, we followed previous work and evaluated on MS-COCO 2017 (Lin et al., 2015), ImageNet-1K (Deng et al., 2009), FFHQ (Karras et al., 2019), and CelebA-HQ (Karras, 2017).

Figure 5: Example videos from TokenBench, including egocentric, driving, robotic manipulation, and web videos.

TokenBench To address the lack of standardized benchmarks for high-resolution, long-duration videos, we introduced TokenBench, covering various domains such as robotics, driving, egocentric, and web videos. TokenBench includes 100 randomly sampled, 10-second clips from each of BDD100K (Yu et al., 2020), EgoExo-4D (Grauman et al., 2024) BridgeData V2 (Walke et al., 2023), and Panda-70M (Chen et al., 2024). We manually filtered low-quality videos in Panda-70M and selected 100 scenes from EgoExo-4D, with one egocentric and one exocentric video per scene, totaling 500 videos. Examples of TokenBench are shown in Figure 5, and we have made it publicly available for the community at github.com/NVlabs/TokenBench. We also evaluated our video tokenizers on the DAVIS (Perazzi et al., 2016) dataset at 1080p.

Baselines and Evaluation Metrics We evaluated our tokenizers across different compression rates, comparing them with state-of-the-art (SOTA) image and video tokenizers (see Table 1 for details). The evaluation metrics include PSNR, SSIM, reconstruction Fréchet Inception Distance (rFID) for images, and Fréchet Video Distance (rFVD) for videos.

Qualitative Results

CV-1 continuous video tokenizer reconstruction

Figure 6: Comparison of video reconstruction with continuous video tokenizer with 4×8×8 compression (CV4×8×8).

Figure 6 and 7 show the reconstructed video frames using continuous video tokenizers, covering diverse scenarios such as natural scenes, autonomous driving, robotic manipulation, and egocentric videos. Figure 9 shows reconstructed images with different discrete image tokenizers, while Figure 8 presents error maps for continuous image tokenizers to highlight reconstruction differences. Compared to previous methods, Cosmos Tokenizer more effectively preserves structural and high-frequency details (e.g., grass, tree branches, text) with minimal visual distortion (e.g., faces, text) and ghosting.

Cosmos

Omni

Cosmos

Omni

Cosmos

Omni

Cosmos

Omni

Figure 7: Comparison of video reconstruction with discrete video tokenizers with 4×8×8 compression (DV4×8×8).

These qualitative results show that Cosmos Tokenizer is capable of encoding and decoding a wide range of visual contents and has the capability to preserve the highest visual quality of images and videos.

Figure 8: Comparison of image reconstruction error maps with continuous image tokenizers with 8×8 compression (CI8×8).

Figure 9: Comparison of image reconstruction with discrete image tokenizers with 16×16 compression (DI16×16).

Quantitative Results

Tables 2 and 3 present average quantitative metrics for continuous and discrete video tokenizers on various benchmarks. Cosmos Tokenizer achieves state-of-the-art performance on both the DAVIS and TokenBench datasets at a 4×8×8 compression ratio. Even at higher compression rates (8×8×8 and 8×16×16), Cosmos Tokenizer outperforms previous methods, demonstrating an excellent compression-quality trade-off.

Table 2: Evaluation of continuous video tokenizers on DAVIS and TokenBench
Tokenizer	Compression Ratio	Formulation	DAVIS			TokenBench
Tokenizer	Compression Ratio	Formulation	PSNR	SSIM	rFVD	PSNR	SSIM	rFVD
CogVideoX	4 × 8 × 8	VAE	29.29	0.864	19.5800	32.06	0.909	6.9700
Omnitokenizer	4 × 8 × 8	VAE	22.23	0.713	117.6600	24.48	0.830	35.8600
Cosmos-CV	4 × 8 × 8	AE	32.80	0.900	15.9300	35.45	0.928	6.8490
Cosmos-CV	8 × 8 × 8	AE	30.61	0.856	30.1600	34.44	0.917	11.6200
Cosmos-CV	8 × 16 × 16	AE	27.60	0.779	93.8200	31.61	0.875	43.0800

Table 3: Evaluation of discrete video tokenizers on DAVIS and TokenBench
Tokenizer	Compression Ratio	Quantization	DAVIS			TokenBench
Tokenizer	Compression Ratio	Quantization	PSNR	SSIM	rFVD	PSNR	SSIM	rFVD
VideoGPT	4 × 4 × 4	VQ	28.17	0.850	72.3300	33.66	0.914	13.8500
Omnitokenizer	4 × 8 × 8	VQ	20.02	0.703	188.6000	25.31	0.827	53.5500
Cosmos-DV	4 × 8 × 8	FSQ	28.81	0.818	37.3600	31.97	0.888	19.6700
Cosmos-DV	8 × 8 × 8	FSQ	27.51	0.789	100.1500	30.95	0.873	43.8600
Cosmos-DV	8 × 16 × 16	FSQ	25.09	0.714	241.5200	28.91	0.829	113.4800

Similarly, Tables 4 and 5 show that Cosmos Tokenizer consistently achieves superior results on various image benchmarks at 8×8 compression rate. Notably, at 4× larger compression rate of 16×16, the image quality of Cosmos Tokenizer often matches or surpasses prior art at 8×8. These results confirm Cosmos Tokenizer's capability to represent visual content effectively with high spatio-temporal compression.

Table 4: Evaluation of continuous image tokenizers on (a) MS-COCO 2017, (b) ImageNet-1K, (c) FFHQ, and (d) CelebA-HQ
Tokenizer	Compression Ratio	Formulation	(a) MS-COCO 2017			(b) ImageNet-1K
			PSNR	SSIM	rFID	PSNR	SSIM	rFID
FLUX	8 × 8	VAE	24.01	0.692	2.5007	20.09	0.518	1.2293
Cosmos-CI	8 × 8	AE	28.66	0.836	1.7601	28.83	0.837	0.6890
Cosmos-CI	16 × 16	AE	23.63	0.663	3.8229	23.72	0.655	1.0307
			(c) FFHQ			(d) CelebA-HQ
FLUX	8 × 8	VAE	38.14	0.960	0.0428	41.01	0.987	0.0401
Cosmos-CI	8 × 8	AE	39.10	0.957	0.0421	46.26	0.988	0.0158
Cosmos-CI	16 × 16	AE	30.84	0.829	0.5035	33.84	0.905	0.3818

Table 5: Evaluation of discrete image tokenizers on (a) MS-COCO 2017, (b) ImageNet-1K, (c) FFHQ, and (d) CelebA-HQ
Tokenizer	Compression Ratio	Quantization	(a) MS-COCO 2017			(b) ImageNet-1K
			PSNR	SSIM	rFID	PSNR	SSIM	rFID
Open-MAGVIT2	16 × 16	LFQ	19.50	0.502	6.6488	17.00	0.398	2.7005
LlamaGen	8 × 8	VQ	21.99	0.617	4.1231	19.64	0.498	1.4034
LlamaGen	16 × 16	VQ	19.11	0.491	6.0767	18.38	0.448	1.6573
Cosmos-DI	8 × 8	FSQ	24.40	0.704	3.7102	24.48	0.701	1.2647
Cosmos-DI	16 × 16	FSQ	20.45	0.529	7.2338	20.49	0.518	2.5178
			(c) FFHQ			(d) CelebA-HQ
Open-MAGVIT2	16 × 16	LFQ	28.20	0.774	1.9943	29.46	0.844	2.8649
LlamaGen	8 × 8	VQ	31.24	0.868	0.7011	33.39	0.936	0.5020
LlamaGen	16 × 16	VQ	27.70	0.772	1.3656	28.86	0.837	1.1128
Cosmos-DI	8 × 8	FSQ	32.50	0.875	0.4588	35.55	0.939	0.2647
Cosmos-DI	16 × 16	FSQ	27.60	0.750	1.9813	28.91	0.808	1.6911

Extended Visual Results

To further demonstrate the effectiveness of Cosmos Tokenizer, we provide comparison videos across driving, robotics, and sport videos. Pause the videos at desired temporal locations to closely inspect image quality differences.

Previous SOTA Continuous Tokenizer

Our Cosmos Tokenizer

Previous SOTA Discrete Tokenizer

Our Cosmos Tokenizer

Previous SOTA Discrete Tokenizer

Our Cosmos Tokenizer

Previous SOTA Discrete Tokenizer

Our Cosmos Tokenizer

Previous SOTA Continuous Tokenizer

Our Cosmos Tokenizer

Runtime Performance

Table 6 compares the parameters and average encoding and decoding times per image or video frame on a single A100 80GB GPU. Cosmos Tokenizer achieves 2× to 12× faster speeds than prior state-of-the-art tokenizers while maintaining the smallest model size, demonstrating high efficiency in encoding and decoding visual content.

Together with the qualitative results, quantitative metric evaluations, and the efficiency evaluation, these results demonstrate that Cosmos Tokenizer is well placed to serve as an effective and efficient building block in both diffusion-based and autoregressive models for image and video generation and the development of a world model. You can try out Cosmos Tokenizer on your projects using the source code and pre-trained models, which are now publicly available.

Table 6: Runtime performance comparison of tokenizers.
Tokenizer	Type	Resolution	Compression Ratio	Parameters	Time (ms)
FLUX	Continuous-Image	1024×1024	8 × 8	84M	242.0
Cosmos-CI	Continuous-Image	1024×1024	8 × 8	77M	52.7
LlamaGen	Discrete-Image	1024×1024	8 × 8	70M	475.0
Cosmos-DI	Discrete-Image	1024×1024	8 × 8	79M	64.2
CogVideoX	Continuous-Video	720×1280	4 × 8 × 8	216M	414.0
OmniTokenizer	Continuous-Video	720×1280	4 × 8 × 8	54M	82.9
Cosmos-CV	Continuous-Video	720×1280	4 × 8 × 8	105M	34.8
OmniTokenizer	Discrete-Video	720×1280	4 × 8 × 8	54M	53.2
Cosmos-DV	Discrete-Video	720×1280	4 × 8 × 8	105M	51.5

Contributors

Fitsum Reda, Jinwei Gu, Xian Liu, Songwei Ge, Ting-Chun Wang, Haoxiang Wang, Ming-Yu Liu

Citation

Please cite as NVIDIA et al. using the following BibTex:

@article{nvidia2025cosmostokenizer,
  title={Cosmos Tokenizer: A Suite of Image and Video Neural Tokenizers},
  author={NVIDIA and Reda, Fitsum and Gu, Jinwei and Liu, Xian and Ge, Songwei and Wang, Ting-Chun and Wang, Haoxiang and Liu, Ming-Yu},
  journal={arXiv preprint arXiv:2501.03575},
  year={2025}
}