Abstract
We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training (TTT). (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and achieves a speed-up over baselines that rely on softmax attention for reconstructing a image collection in just seconds. Because our method retains global scene aggregation capability, our resulting point map reconstruction error is comparable to VGGT.
Qualitative comparisons
We provide an interactive comparison of reconstructions obtained on an image sequence of 1,000 images using VGGT, TTT3R, and our method.
While TTT3R fails to reconstruct the scene completely, provides a complete reconstruction. VGGT’s quality is slightly higher, however, it takes more than 11x longer to reconstruct the scene.
This highlights the effectivness of our linear-time global attention replacement based on test-time training (TTT).
Feed-forward visual localization
In comparision to VGGT, our network can be queried with new observations after processing a set of images representing a scene. It outputs scene geometry and camera pose of the new image relative to the existing reconstruction.
To do so, we keep test-time optimized weights frozen, and run standard forward pass for a new query image, with one key modification: in the global attention layers, we only apply the frozen MLPs to the query features to retrieve information from the scene representation, without updating the MLP parameters .
This effectively transforms the model into a single-image transformer for query processing.
Here we show that we can use this mechanism to perform visual localization in real-time (10fps) for unseen query images shown in the top left corner.
Method
replaces the global attention block in VGGT [1] (left) with a linear-time alternative based on test-time training (TTT) (right) to compress the KV space into a fixed-size MLP following [2, 3]
We linearize the existing VGGT checkpoint, loading pre-trained weights for almost all layers and fine-tune only the global attention layers. We find that linearization is crucial for achieving good performance.
Test-time scaling for length generalization
When using TTT, we observe large degradation in reconstruction performance when processing out-of-distribution sequence lengths (Fig (b), blue curve).
Our investigation shows that for image collection sizes seen during training, one optimizer step is sufficient, while for 1k images, it is beneficial to increase number of optimizer steps. By simply performing two steps (vs. one at training time), we achieve almost perfect length generalization (Fig. (b), orange curve).
[1] Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual Geometry Grounded Transformer. In CVPR, 2025.
[2] Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states. arXiv preprint arXiv:2407.04620, 2024
[3] Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Free- man, and Hao Tan. Test-Time Training Done Right. arXiv preprint arXiv:2505.23884, 2025.