Understanding and modeling lighting effects are fundamental tasks in computer vision and graphics. Classic physically-based rendering (PBR) accurately simulates the light transport, but relies on precise scene representations—explicit 3D geometry, high-quality material properties, and lighting conditions—that are often impractical to obtain in real-world scenarios. Therefore, we introduce DiffusionRenderer, a neural approach that addresses the dual problem of inverse and forward rendering within a holistic framework. Leveraging powerful video diffusion model priors, the inverse rendering model accurately estimates G-buffers from real-world videos, providing an interface for image editing tasks, and training data for the rendering model. Conversely, our rendering model generates photorealistic images from G-buffers without explicit light transport simulation. Experiments demonstrate that DiffusionRenderer effectively approximates inverse and forwards rendering, consistently outperforming the state-of-the-art. Our model enables practical applications from a single video input—including relighting, material editing, and realistic object insertion.
Method overview. Given an input video, the neural inverse renderer estimates geometry and material properties per pixel. It generates one scene attribute at a time, with the domain embedding indicating the target attributes to generate. Conversely, the neural forward renderer produces photorealistic images given lighting information, geometry, and material buffers. The lighting condition is injected into the base video diffusion model through cross-attention layers. During joint training with both synthetic and real data, we use an optimizable LoRA for real data sources.
DiffusionRenderer is motivated to consider the inverse and forward rendering problems jointly, overcoming limitations of classic physically-based rendering (PBR) methods.
Classic PBR relies on explicit 3D geometry such as meshes. When it is not available, screen space ray tracing (SSRT) struggles to accurately represent shadows and reflections. Our forward renderer synthesizes photorealistic lighting effects without explicit path tracing and 3D geometry.
PBR is also sensitive to errors in G-buffers. SSRT with estimated G-buffers from state-of-the-art inverse rendering models often fails to deliver quality results. Our forward renderer is trained to tolerate noisy buffers.
Video generation from G-buffers. The forward renderer generates accurate shadows and reflections that are consistent across viewpoints. Notably, these lighting effects are synthesized entirely from an environment map, despite the input G-buffers containing no explicit shadow or reflection information.
Qualitative comparison of forward rendering. Our method generates high-quality inter-reflections and shadows, producing more accurate results than the neural baselines.
We demonstrate the effectiveness of our combined inverse and forward rendering model in the relighting task.
We use the estimated G-buffers from the inverse renderer to relight the scene with different lighting conditions.
DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models
Ruofan Liang*, Zan Gojcic, Huan Ling, Jacob Munkberg, Jon Hasselgren, Zhi-Hao Lin, Jun Gao, Alexander Keller, Nandita Vijaykumar, Sanja Fidler, Zian Wang*
@article{DiffusionRenderer,
author = {Ruofan Liang and Zan Gojcic and Huan Ling and Jacob Munkberg and
Jon Hasselgren and Zhi-Hao Lin and Jun Gao and Alexander Keller and
Nandita Vijaykumar and Sanja Fidler and Zian Wang},
title = {DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models},
journal = {arXiv preprint arXiv: 2501.18590},
year = {2025}
}
The authors thank Shiqiu Liu, Yichen Sheng, and Michael Kass for their insightful discussions that contributed to this project. We also appreciate the discussions with Xuanchi Ren, Tianchang Shen and Zheng Zeng during the model development process.
|