DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer

¹NVIDIA

²University of Toronto

³Cornell University

⁴Technion

*Equal Contribution

CVPR 2026

article Paper format_quote BibTex Code Model Dataset

Abstract

Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations, we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into temporally consistent outputs while improving their realism. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step image diffusion model, capable of running in online simulators on a single GPU. The key to training it effectively is a custom data curation pipeline that constructs synthetic–real pairs emphasizing appearance harmonization, artifact correction, and lighting realism. The result is a scalable system that significantly elevates simulation fidelity in both research and production environments.

Method

We convert a pretrained multi-step diffusion model into a single-step, temporally conditioned enhancer suitable for online use, supported by custom data-generation pipelines and training strategies that enforce stability. Diffusion backbone (CosmosPredict2 0.6B text-to-image model) is fine tuned on real-world and sim training pairs generated using scalable data curation pipelines capturing:

Color and lighting harmonization — leveraging DiffusionRenderer
Shadow correction — leveraging PBR synthetic data and an asset reinsertion pipeline
Artefact correction — leveraging learnings from Difix3D+

Temporal consisency of the outputs is obtained thanks to

Temporally consistent data: Enhance training dataset with images from fully consistent video sequences
Conditioning on previous frames during training and inference
Temporal Total Variation loss
2-stage training recipe to maintain quality and prevent drift: non-temporal training followed by temporal training

The method is application agnostic and can be deployed across different domains without modification.

Gallery

We present results of our method applied to various driving scenarios.

DiffusionHarmonizer applied to out-of-domain Waymo Open Dataset driving scenarios.

Comparisons

We compare our method to SOTA image and video editing (SDEdit, Wan-video V2V) and harmonization (VHTT, Ke et al.) baselines.

Acknowledgments

The authors would like to thank their NVIDIA colleagues Martin Antolini, Carlos Casanova, Apurv Naman, Steve Lemke and Seth Robert Piezas for their invaluable contributions to the data curation and release process.

Citation

@article{zhang2026diffusionharmonizer,
  title   = {DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer},
  author  = {Yuxuan Zhang and Katar\'{i}na T\'{o}thov\'{a} and Zian Wang and Kangxue Yin and Haithem Turki and Riccardo de Lutio and Yen-Yu Chang and Or Litany and Sanja Fidler and Zan Gojcic},
  journal = {arXiv preprint arXiv:2602.24096},
  year    = {2026},
  url     = {https://arxiv.org/abs/2602.24096},
}