Déjà View Looping Transformers for Multi-View 3D Reconstruction

1NVIDIA 2University of Modena and Reggio Emilia, AImageLab 3University of Toronto, Vector Institute 4ETH Zürich *Equal contribution Equal supervision

TL;DR

Déjà View reconstructs camera poses and dense geometry from any number of views by applying the same transformer block in a looped fashion. A single checkpoint covers a range of step counts, so users can dial compute up or down at inference time. At only 117M parameters, Déjà View beats much larger feed-forward baselines while using 8–10× fewer parameters and 1.9–2.3× less compute (π³, Depth Anything 3 — G) on five reconstruction benchmarks spanning indoor and outdoor scenes.

Reconstruction Output RGB Input

Abstract

Recent feed-forward 3D reconstruction transformers have scaled to over a billion parameters, following the broader trend of increasing model capacity in computer vision. Yet emerging evidence suggests that contiguous transformer layers often behave like repeated applications of similar operations, and multi-view reconstruction transformers refine their predictions progressively across decoder depth. We posit that model depth partially buys iteration, paid for inefficiently in unique parameters, and instead make that iteration explicit in architecture.

Our model, Déjà View, applies a single looped transformer block recurrently to per-view features for KK refinement steps. Trained once, it exposes KK as an inference-time compute knob, matching or outperforming substantially larger feed-forward baselines across five reconstruction benchmarks spanning indoor, outdoor, object-centric, and driving scenes, while using a fraction of their parameters and comparable or lower compute. Importantly, the same looped block formulation outperforms an otherwise identical variant with independent per-step parameters under matched training data and compute, suggesting that explicit iteration is not merely a compute-efficient substitute for capacity but a stronger inductive bias for multi-view 3D reconstruction.

DéjàView. Given multiple input views (top-left), DéjàView reconstructs camera poses and consistent depth by repeatedly applying the same transformer block, with the number of refinement steps K exposed as an inference-time compute knob. Decoding the intermediate state of a single K=16 forward pass at iterations k ∈ 16 shows progressively sharper geometry and more accurate camera poses (right; frustums are colored by per-camera error). Across five benchmarks (bottom-left), DéjàView matches or surpasses much larger feed-forward baselines at a small fraction of their parameter count (dot area).

Déjà View teaser: iterative refinement of camera poses and depth across K=2,4,8,16 steps.

Interactive viewer

Slide the iteration control inside the viewer to watch the same forward pass refine itself. Frustums are colored by per-camera pose error after Sim(3) alignment to ground-truth poses.

Method

Déjà View initializes per-view features from a pretrained DINOv2 encoder and applies a single transformer block — with frame and global attention sub-blocks — recurrently KK times to refine the state, with each application conditioned on its continuous time interval (tk,tk+1)(t_k, t_{k+1}). Because KK is sampled per batch from [Kmin,Kmax][K_\text{min}, K_\text{max}] during training, one trained checkpoint covers any step count in that range at inference. Two lightweight heads then decode the final state into per-view depth and ray maps.

Method overview: V images encoded by a shared DINOv2 backbone, then a single looped transformer block applied K times, decoded into depth and ray maps.

Results

At 117M parameters, Déjà View leads average inlier ratio and pose AUC@30° across all five benchmarks at the smallest parameter count of any baseline. Bubble area is proportional to parameter count.

100M300M1000MParameters (log scale)55606570758085Avg. Inlier Ratio (%)Bubble area ∝ parameters
Pi3· 959M
VGGT· 1257M
DA3-L· 356M
DA3-G· 1201M
Déjà View· 117M
100M300M1000MParameters (log scale)7580859095Avg. Pose AUC@30° (%)
Pi3· 959M
VGGT· 1257M
DA3-L· 356M
DA3-G· 1201M
Déjà View· 117M

Qualitative comparison

Left viewer shows Déjà View. Use the dropdown on the right to compare against recent feed-forward baselines on the same sequence. Pick a different example using the strip below. All sequences are visualized discarding the bottom 25% of points by predicted confidence (if available).

Déjà View · 117M

Citation

@misc{burzio2026dejaview,
title = {D\'ej\`a View: Looping Transformers for Multi-View 3D Reconstruction},
author = {Burzio, Alessandro and Fischer, Tobias and Elflein, Sven and Zhou, Qunjie and de Lutio, Riccardo and Ren, Jiawei and Huang, Jiahui and Huang, Shengyu and Pollefeys, Marc and Leal-Taix\'e, Laura and Gojcic, Zan and Turki, Haithem},
year = {2026},
eprint = {2605.30215},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2605.30215},
}

Acknowledgements

We thank our colleagues at NVIDIA for valuable discussions and feedback.