Déjà View Looping Transformers for Multi-View 3D Reconstruction

¹NVIDIA ²University of Modena and Reggio Emilia, AImageLab ³University of Toronto, Vector Institute ⁴ETH Zürich ^*Equal contribution ^†Equal supervision

Paper arXiv Code Interactive viewer BibTeX

TL;DR

Déjà View reconstructs camera poses and dense geometry from any number of views by applying the same transformer block in a looped fashion. A single checkpoint covers a range of step counts, so users can dial compute up or down at inference time. At only 117M parameters, Déjà View beats much larger feed-forward baselines while using 8–10× fewer parameters and 1.9–2.3× less compute (π³, Depth Anything 3 — G) on five reconstruction benchmarks spanning indoor and outdoor scenes.

Reconstruction Output RGB Input

Abstract

Recent feed-forward 3D reconstruction transformers have scaled to over a billion parameters, following the broader trend of increasing model capacity in computer vision. Yet emerging evidence suggests that contiguous transformer layers often behave like repeated applications of similar operations, and multi-view reconstruction transformers refine their predictions progressively across decoder depth. We posit that model depth partially buys iteration, paid for inefficiently in unique parameters, and instead make that iteration explicit in architecture.

Our model, Déjà View, applies a single looped transformer block recurrently to per-view features for $K$ refinement steps. Trained once, it exposes $K$ as an inference-time compute knob, matching or outperforming substantially larger feed-forward baselines across five reconstruction benchmarks spanning indoor, outdoor, object-centric, and driving scenes, while using a fraction of their parameters and comparable or lower compute. Importantly, the same looped block formulation outperforms an otherwise identical variant with independent per-step parameters under matched training data and compute, suggesting that explicit iteration is not merely a compute-efficient substitute for capacity but a stronger inductive bias for multi-view 3D reconstruction.

Déjà View teaser: iterative refinement of camera poses and depth across K=2,4,8,16 steps. — DéjàView. Given multiple input views (top-left), DéjàView reconstructs camera poses and consistent depth by repeatedly applying the same transformer block, with the number of refinement steps K exposed as an inference-time compute knob. Decoding the intermediate state of a single K=16 forward pass at iterations k ∈ 16 shows progressively sharper geometry and more accurate camera poses (right; frustums are colored by per-camera error). Across five benchmarks (bottom-left), DéjàView matches or surpasses much larger feed-forward baselines at a small fraction of their parameter count (dot area).

Interactive viewer

Slide the iteration control inside the viewer to watch the same forward pass refine itself. Frustums are colored by per-camera pose error after Sim(3) alignment to ground-truth poses.

Method

Déjà View initializes per-view features from a pretrained DINOv2 encoder and applies a single transformer block — with frame and global attention sub-blocks — recurrently $K$ times to refine the state, with each application conditioned on its continuous time interval $(t_k, t_{k+1})$ . Because $K$ is sampled per batch from $[K_\text{min}, K_\text{max}]$ during training, one trained checkpoint covers any step count in that range at inference. Two lightweight heads then decode the final state into per-view depth and ray maps.

Results

At 117M parameters, Déjà View leads average inlier ratio and pose AUC@30° across all five benchmarks at the smallest parameter count of any baseline. Bubble area is proportional to parameter count.

Pi3· 959M

VGGT· 1257M

DA3-L· 356M

DA3-G· 1201M

Déjà View· 117M

Pi3· 959M

VGGT· 1257M

DA3-L· 356M

DA3-G· 1201M

Déjà View· 117M

Qualitative comparison

Left viewer shows Déjà View. Use the dropdown on the right to compare against recent feed-forward baselines on the same sequence. Pick a different example using the strip below. All sequences are visualized discarding the bottom 25% of points by predicted confidence (if available).

Déjà View · 117M

Point size 1.00 px

Citation

@misc{burzio2026dejaview,
  title         = {D\'ej\`a View: Looping Transformers for Multi-View 3D Reconstruction},
  author        = {Burzio, Alessandro and Fischer, Tobias and Elflein, Sven and Zhou, Qunjie and de Lutio, Riccardo and Ren, Jiawei and Huang, Jiahui and Huang, Shengyu and Pollefeys, Marc and Leal-Taix\'e, Laura and Gojcic, Zan and Turki, Haithem},
  year          = {2026},
  eprint        = {2605.30215},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.30215},
}

Acknowledgements

We thank our colleagues at NVIDIA for valuable discussions and feedback.