TL;DR
Déjà View reconstructs camera poses and dense geometry from any number of views by applying the same transformer block in a looped fashion. A single checkpoint covers a range of step counts, so users can dial compute up or down at inference time. At only 117M parameters, Déjà View beats much larger feed-forward baselines while using 8–10× fewer parameters and 1.9–2.3× less compute (π³, Depth Anything 3 — G) on five reconstruction benchmarks spanning indoor and outdoor scenes.
Abstract
Recent feed-forward 3D reconstruction transformers have scaled to over a billion parameters, following the broader trend of increasing model capacity in computer vision. Yet emerging evidence suggests that contiguous transformer layers often behave like repeated applications of similar operations, and multi-view reconstruction transformers refine their predictions progressively across decoder depth. We posit that model depth partially buys iteration, paid for inefficiently in unique parameters, and instead make that iteration explicit in architecture.
Our model, Déjà View, applies a single looped transformer block recurrently to per-view features for refinement steps. Trained once, it exposes as an inference-time compute knob, matching or outperforming substantially larger feed-forward baselines across five reconstruction benchmarks spanning indoor, outdoor, object-centric, and driving scenes, while using a fraction of their parameters and comparable or lower compute. Importantly, the same looped block formulation outperforms an otherwise identical variant with independent per-step parameters under matched training data and compute, suggesting that explicit iteration is not merely a compute-efficient substitute for capacity but a stronger inductive bias for multi-view 3D reconstruction.
DéjàView. Given multiple input views (top-left), DéjàView reconstructs camera poses and consistent depth by repeatedly applying the same transformer block, with the number of refinement steps K exposed as an inference-time compute knob. Decoding the intermediate state of a single K=16 forward pass at iterations k ∈ 16 shows progressively sharper geometry and more accurate camera poses (right; frustums are colored by per-camera error). Across five benchmarks (bottom-left), DéjàView matches or surpasses much larger feed-forward baselines at a small fraction of their parameter count (dot area).
Interactive viewer
Slide the iteration control inside the viewer to watch the same forward pass refine itself. Frustums are colored by per-camera pose error after Sim(3) alignment to ground-truth poses.
Method
Déjà View initializes per-view features from a pretrained DINOv2 encoder and applies a single transformer block — with frame and global attention sub-blocks — recurrently times to refine the state, with each application conditioned on its continuous time interval . Because is sampled per batch from during training, one trained checkpoint covers any step count in that range at inference. Two lightweight heads then decode the final state into per-view depth and ray maps.
Results
At 117M parameters, Déjà View leads average inlier ratio and pose AUC@30° across all five benchmarks at the smallest parameter count of any baseline. Bubble area is proportional to parameter count.
Qualitative comparison
Left viewer shows Déjà View. Use the dropdown on the right to compare against recent feed-forward baselines on the same sequence. Pick a different example using the strip below. All sequences are visualized discarding the bottom 25% of points by predicted confidence (if available).
Citation
@misc{burzio2026dejaview, title = {D\'ej\`a View: Looping Transformers for Multi-View 3D Reconstruction}, author = {Burzio, Alessandro and Fischer, Tobias and Elflein, Sven and Zhou, Qunjie and de Lutio, Riccardo and Ren, Jiawei and Huang, Jiahui and Huang, Shengyu and Pollefeys, Marc and Leal-Taix\'e, Laura and Gojcic, Zan and Turki, Haithem}, year = {2026}, eprint = {2605.30215}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2605.30215},}Acknowledgements
We thank our colleagues at NVIDIA for valuable discussions and feedback.