Spatial Intelligence Lab NVIDIA Research

DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer

1NVIDIA
2University of Toronto
3Cornell University
4Technion
*Equal Contribution
CVPR 2026

Abstract


Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations, we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into temporally consistent outputs while improving their realism. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step image diffusion model, capable of running in online simulators on a single GPU. The key to training it effectively is a custom data curation pipeline that constructs synthetic–real pairs emphasizing appearance harmonization, artifact correction, and lighting realism. The result is a scalable system that significantly elevates simulation fidelity in both research and production environments.

Method


We convert a pretrained multi-step diffusion model into a single-step, temporally conditioned enhancer suitable for online use, supported by custom data-generation pipelines and training strategies that enforce stability. Diffusion backbone (CosmosPredict2 0.6B text-to-image model) is fine tuned on real-world and sim training pairs generated using scalable data curation pipelines capturing:

Temporal consisency of the outputs is obtained thanks to

  1. Temporally consistent data: Enhance training dataset with images from fully consistent video sequences
  2. Conditioning on previous frames during training and inference
  3. Temporal Total Variation loss
  4. 2-stage training recipe to maintain quality and prevent drift: non-temporal training followed by temporal training

The method is application agnostic and can be deployed across different domains without modification.


Comparisons


We compare our method to SOTA image and video editing (SDEdit, Wan-video V2V) and harmonization (VHTT, Ke et al.) baselines.


Acknowledgments


The authors would like to thank their NVIDIA colleagues Martin Antolini, Carlos Casanova, and Apurv Naman for their invaluable contributions to the data curation process.

Citation


@article{zhang2026diffusionharmonizer,
  title   = {DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer},
  author  = {Yuxuan Zhang and Katar\'{i}na T\'{o}thov\'{a} and Zian Wang and Kangxue Yin and Haithem Turki and Riccardo de Lutio and Yen-Yu Chang and Or Litany and Sanja Fidler and Zan Gojcic},
  journal = {arXiv preprint arXiv:2602.24096},
  year    = {2026},
  url     = {https://arxiv.org/abs/2602.24096},
}