InfoTok — Cosmos Lab

Each video shows: left, original video; middle, reconstruction with token usage mask overlaid; right, brighter regions indicate where tokens are kept by InfoTok, and black regions are tokens dropped (the bar indicates average token usage in each frame). Notice how the tokenizer automatically spends more tokens on dynamic, information-rich areas and fewer on static or predictable regions.

Motivation

The recent wave of foundational generative models from large language models to video generators has converged on a shared recipe: compress raw data into a compact latent representation, then model that representation with powerful generators such as autoregressive models and diffusion models. For visual content, this first step is tokenization: turning pixels into a sequence of tokens (discrete or continuous) that downstream models can consume and generate.

Arguably, what makes a good tokenizer or whether we need a tokenizer for visual content remains an open question. We believe that at least the following criteria should be met:

Compression: The tokenizer should significantly reduce the data size, enabling efficient generation.
Fidelity: The tokenizer should preserve enough information to allow high-quality reconstruction of the original video.
Semantic meaningfulness: The tokens should capture semantically meaningful aspects of the video, enabling downstream models to leverage them effectively for generation.

Today's visual tokenizers don't really satisfy these well. They chop frames into uniform grids and compress every video at a fixed rate: a static surveillance feed and a fast-paced action scene get the exact same number of tokens. But is a mostly-still landscape really as complex as a crowded street with pedestrians, cars, and changing traffic lights? Obviously not.

Intuitively, a good tokenizer should spend more tokens where the content is complex and fewer where it is simple. This naturally raises the question: how much compression is right for a given video, and can we let the content itself decide? In this work, we turn to information theory for an answer.

TL;DR: InfoTok is an information-theoretic video tokenizer that adaptively allocates token lengths based on video complexity. Grounded in Shannon's theory, it uses an ELBO-based router for near-optimal compression, achieving 2.3× compression while outperforming prior heuristic adaptive approaches with 11× faster inference.

Theory

Let us formalize this intuition. In the second teaser video, both sides of the frame are completely white throughout, yet a fixed-rate tokenizer still spends the same number of tokens on those empty regions as on the actual content in the center. This is pure waste.

Shannon's information theory tells us: the more predictable something is, the fewer bits (tokens) it should take to represent it. Conversely, rare and surprising content deserves more bits. When this principle is followed, the total representation cost is minimized. Formally:

T: any tokenizer with codebook size C; N_x: number of tokens assigned to video x; p(x): probability of video x under the data distribution; H_C(D): entropy of the data distribution (base C). The average token count is lower-bounded by the entropy, and an adaptive tokenizer can get within one token of it.

The key insight is that the optimal token count for each video depends on its likelihood p(x). More probable (predictable) videos should receive fewer tokens, following N_x ≈ −log_C p(x). This is the same idea behind Huffman coding in language: frequent characters like "e" get short codes, while rare ones like "z" get long codes.

In letter encoding, Huffman coding assigns shorter codes to more frequent letters. The same principle should apply to visual tokens.

Method: From Theory to Practice

So we know that optimal tokenization should adapt to each video's information content. Two concrete questions follow:

How many tokens should each video get? The theorem tells us the answer depends on p(x), but we can't compute the true likelihood of a video. How do we estimate the right token count in practice?
Given that number, how do we actually encode the tokens? Standard tokenizers produce fixed-length sequences. We need an architecture that can compress into a variable number of tokens and still reconstruct well.

InfoTok addresses these two challenges with two corresponding components: an ELBO-based router (for deciding how many tokens) and an adaptive compressor (for encoding them). Both sit on top of any existing fixed-rate base tokenizer (we use Cosmos) as a plug-in module.

Overview of the InfoTok framework. A router decides the token count N_x based on video complexity; the adaptive compressor converts fixed-length embeddings into N_x discrete tokens.

How many tokens? Let ELBO decide

We can't compute p(x) directly, but we can compute its evidence lower bound (ELBO), a tractable proxy that measures how "predictable" a video is under the base tokenizer. This leads to a key theoretical result:

T_adaptive = (T, r, M_ψ): an adaptive tokenizer consisting of base tokenizer T, router r, and adaptive compressor M_ψ; β: parameter controlling average compression level; L_recon: reconstruction loss. The theorem says: if we use the ELBO-based router and train to minimize reconstruction loss, the expected token count is bounded by the entropy plus a gap term that vanishes when the ELBO is tight.

In plain language: using ELBO to decide the token count achieves near-optimal compression. Higher ELBO (more predictable content) → fewer tokens; lower ELBO (more complex content) → more tokens. Concretely, the router computes:

where β controls the average compression level. The ELBO can be computed cheaply from any pre-trained tokenizer, with no additional model needed.

How to encode? Adaptive compression

Once the router decides the token count N_x, we need to compress fixed-length embeddings into a variable-length token sequence. We design a transformer-based adaptive (de-)compressor that learns to allocate information intelligently across the adaptive number of tokens.

Algorithm 1: Training procedure of InfoTok.

More architectural details can be found in the paper.

Visualization: Seeing Is Believing

Adaptive tokenization in action

Below, we visualize the token masks generated by InfoTok. The left shows the original video; the right shows the reconstruction with its token usage overlaid. Brighter regions indicate where more tokens are allocated. Notice how InfoTok automatically concentrates tokens on moving objects and fine textures, while spending almost nothing on static backgrounds.

Driving scene (BDD): Tokens concentrate on moving vehicles and road edges, while the static sky and pavement receive almost none.

Robotic manipulation (Bridge): The robot arm and the object being manipulated light up, while the fixed tabletop background stays dark.

Egocentric biking (EgoExo4D): The entire scene is in motion due to camera movement, so tokens are spread more evenly, but the surrounding black part (due to the camera's field of view) can be completely masked out due to its predictable nature.

Reconstruction across compression levels

A unique property of InfoTok is that it can tokenize at any token length, gracefully trading off compression for quality. Each row below shows the same video reconstructed at five compression levels — from the original through progressively higher compression.

Driving scenes (BDD): Highway scenes with mostly static backgrounds degrade gracefully under heavy compression, while complex intersections need more tokens to preserve detail.

Robotic manipulation (Bridge): Fixed camera and simple backgrounds mean quality holds up well at low token counts; fine-grained hand-object interactions are the last to degrade.

Ego/exo activities (EgoExo4D): Fast camera motion and diverse activities make these information-dense; quality drops more at extreme compression, confirming complex content genuinely needs more tokens.

Internet videos (Panda-70M): A diverse mix of content. InfoTok handles all of them with a single model, adapting token counts to each video's complexity.

Results: The Numbers Back It Up

We evaluate InfoTok on standard video reconstruction benchmarks (TokenBench and DAVIS), using the Cosmos tokenizer as our base. We compare two InfoTok variants — InfoTok (fixed ELBO router) and InfoTok-Flex (flexible router) — against fixed-rate baselines and ElasticTok, the leading heuristic adaptive approach.

Evaluation of fixed-length and adaptive tokenizers on TokenBench and DAVIS. We compare InfoTok with ElasticTok at two compression levels (0.81, 0.56) by setting our compression rates to theirs. The best results at the same levels are bold.

The takeaway: InfoTok can save 20% of tokens with no quality loss, and at 2.3× compression it still outperforms ElasticTok across all metrics. The ELBO-based router consistently outperforms heuristic approaches at every compression level:

Quality metrics (PSNR↑, LPIPS↓, FVD↓) vs. compression rate (BPP₁₆) on TokenBench (a-c) and DAVIS (d-f). Inference efficiency is shown in (g). InfoTok dominates across all levels while being significantly more efficient.

Conclusion

InfoTok demonstrates that information theory provides not just a conceptual framework, but a practical and provably near-optimal recipe for adaptive visual tokenization. By replacing heuristic compression strategies with an ELBO-based router grounded in Shannon's theory, we achieve better reconstruction quality with significantly fewer tokens, and with minimal overhead on top of existing tokenizers.

We believe this is just the beginning. Several exciting directions lie ahead:

Continuous tokens: InfoTok currently operates in the discrete token space. However, the information-theoretic framework is equally applicable to continuous latent representations, adaptively allocating latent dimensions or channels based on content complexity.
Downstream generation: A natural next step is to integrate adaptive tokenization into video generation pipelines, where variable-length token sequences could unlock both quality and efficiency gains.
Beyond video: The information-theoretic principle behind InfoTok is not specific to video. Images, 3D scenes, and multi-modal data all exhibit non-uniform information density.
Joint optimization with generation models: Co-optimizing the tokenizer and the generation model with information-theoretic objectives could lead to even more efficient representations tailored to the downstream task.

We hope InfoTok and its adaptive tokenization framework can be useful to the community, whether as a drop-in replacement for fixed-rate tokenizers, or as a starting point for building more efficient visual generation systems.

Citation

@inproceedings{ye2026infotok,
  title={InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression},
  author={Ye, Haotian and He, Qiyuan and Han, Jiaqi and Li, Puheng and Fan, Jiaojiao and Hao, Zekun and Reda, Fitsum and Balaji, Yogesh and Chen, Huayu and Liu, Sheng and Yao, Angela and Zou, James and Ermon, Stefano and Wang, Haoxiang and Liu, Ming-Yu},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}