TokenGS:
Decoupling 3D Gaussian Prediction
from Pixels with Learnable Tokens

NVIDIA

* Equal Contribution

CVPR 2026 (Highlight)

Paper

Code

HuggingFace Citation

TL;DR: TokenGS predicts 3D Gaussians with a self-supervised rendering objective. An encoder–decoder stacks learnable Gaussian tokens so the number of primitives is not tied to image resolution or view count.

Gallery

DL3DV Results

Interactive viewer for 6-view reconstruction on DL3DV (448×256 resolution).

RE10K Results

Comparison between our method and GS-LRM on 2-view reconstruction on RE10K (256×256 resolution). Note the GS-LRM artifacts visible in bird’s eye view.

Test-Time Training

Comparison between three test-time training methods.

Scene Extrapolation

Comparison between our method and GS-LRM on scene extrapolation. Both methods are finetuned with extrapolation view sampling.
Left: GS-LRM. Middle: Ours. Right: GT.

Dynamic Reconstruction

Comparison between BTimer and our method on dynamic reconstruction.
Left: BTimer. Right: Ours.

BTimer

Ours

Emergent Scene Flow

Trajectories of the dynamic Gaussians across time.

Abstract

In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views.

Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.

Method

Model Architecture. The model follows an encoder-decoder structure. In the decoder, 3DGS tokens are fed in as queries to obtain the final Gaussian attributes. After the base model is trained, we allow test-time token tuning from input images to improve reconstruction quality.

TokenGS:Decoupling 3D Gaussian Predictionfrom Pixels with Learnable Tokens