Spatial Intelligence Lab NVIDIA Research

ArtiFixer: Enhancing and Extending 3D Reconstruction
with Auto-Regressive Diffusion Models

1NVIDIA
2ETH Zurich
3Cornell University
4University of Toronto
5Vector Institute
* Equal Contribution
SIGGRAPH 2026

Abstract


Per-scene optimization methods such as 3D Gaussian Splatting provide state-of-the-art novel view synthesis quality but extrapolate poorly to under-observed areas. Methods that leverage generative priors to correct artifacts in these areas hold promise but currently suffer from two shortcomings. The first is scalability, as existing methods use image diffusion models or bidirectional video models that are limited in the number of views they can generate in a single pass (and thus require a costly iterative distillation process for consistency). The second is quality itself, as generators used in prior work tend to produce outputs that are inconsistent with existing scene content and fail entirely in completely unobserved regions.

To solve these, we propose a two-stage pipeline that leverages two key insights. First, we train a powerful bidirectional generative model with a novel opacity mixing strategy that encourages consistency with existing observations while retaining the model's ability to extrapolate novel content in unseen areas. Second, we distill it into a causal auto-regressive model that generates hundreds of frames in a single pass. This model can directly produce novel views or serve as pseudo-supervision to improve the underlying 3D representation in a simple and highly efficient manner. We evaluate our method extensively and demonstrate that it can generate plausible reconstructions in scenarios where existing approaches fail completely. When measured on commonly benchmarked datasets, we outperform all existing baselines by a wide margin, exceeding prior state-of-the-art methods by 1–3 dB PSNR.

Method


ArtiFixer is a two-stage pipeline. In Phase I, we finetune a bidirectional video diffusion model using an opacity mixing strategy: rather than starting from pure noise or directly from degraded renderings, we encode the input RGB into latent space and mix with Gaussian noise using the rendered opacity maps. This encourages the model to remain consistent with existing scene content while retaining full generative capability in unseen regions. We additionally inject fine-grained opacity information and camera control signals, along with clean reference views and an optional text prompt.

In Phase II, we distill the bidirectional teacher into a causal autoregressive model via Self-Forcing-style DMD distillation. The resulting model generates hundreds of frames in a single pass, which can be used directly for novel view synthesis or as pseudo-supervision to improve the underlying 3D representation.

ArtiFixer method overview

MipNeRF 360 Comparisons


We render novel orbit trajectories and compare ArtiFixer3D+ to its base 3DGUT rendering, GenFusion, and GSFixer on all scenes in MipNeRF 360's most challenging 3-view split. Our quality exceeds, to our knowledge, all other previously published work.

DL3DV Comparisons


We compare our method to a variety of generative baselines on sparse reconstructions from the DL3DV-10K dataset.

We compare ArtiFixer3D+ on DL3DV to 3DGUT and two baselines that build upon bidirectional video diffusion models. GenFusion's base model generates 16 frames at a time, requiring an iterative distillation process that leads to blurry results, especially in empty areas. Gen3C's renderings are sharper but often do not respect the source content. Our method reconstructs plausible and consistent geometry even when the initial rendering is highly degraded.

Nerfbusters Comparisons


As in the other datasets, our method is the only one that can generate plausible visuals in unobserved areas while respecting source fidelity.

Conditioning


We drop the initial rendering condition, forcing the model to reconstruct the scene from the reference views. Although fidelity drops somewhat, the high-level structure of the scene remains intact along with the correct camera motion.

Prediction Ground Truth
Reference Views

ArtiFixer retains a strong generative ability thanks to opacity mixing and training dropout, and is able to generate videos from text prompts alone, similar to its base model.

Prompt: A bronze statue of two children standing back to back on a stone pedestal inside a modern exhibition hall. The taller child, wearing a dress and carrying a backpack, has one hand resting on the smaller child's shoulder…

ArtiFixer Variants


We evaluate three variants: ArtiFixer, which directly renders novel views from the auto-regressive generator; ArtiFixer3D, which distills its outputs back into the underlying 3D representation; and ArtiFixer3D+, which re-applies the auto-regressive model as post-processing on top of ArtiFixer3D (as in Difix3D+). All variants produce similar renderings: ArtiFixer's are slightly sharper, ArtiFixer3D's are more consistent with source images at the cost of some blurriness, and ArtiFixer3D+ restores sharpness while remaining highly consistent.

Denoising Steps


As our method starts from renderings instead of pure noise, it can generate plausible visuals in fewer than four steps in most cases, though sharpness and temporal consistency suffer somewhat in empty areas. We compare different denoising step counts on a slightly shifted trajectory distilled into the representation (ArtiFixer3D). Rendering is generally stable across denoising counts except for minor changes near previously unexplored periphery.

Citation


@inproceedings{delutio2026artifixer,
    title={ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models},
    author={de Lutio, Riccardo and Fischer, Tobias and Chang, Yen-Yu and Zhang, Yuxuan and
            Wu, Jay Zhangjie and Ren, Xuanchi and Shen, Tianchang and Tothova, Katarina and
            Gojcic, Zan and Turki, Haithem},
    booktitle={SIGGRAPH},
    year={2026}
}