Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

¹ NVIDIA

² University of Toronto

³ Vector Institute

⁴ Simon Fraser University

* Equal Contribution

arXiv 2025

article Paper code Code format_quote BibTeX

TL;DR: Feed-forward 3D and 4D scene generation from a single image/video trained with synthetic data generated by a camera-controlled video diffusion model.

Humanoid Robot Simulation in Generated 3D Scenes

We generate 3D Gaussians from text and import them into NVIDIA Isaac Sim, where Omniverse RTX Neural Rendering (NuRec) enables real-time visualization of 3D Gaussian Splats alongside physical interaction. This demo was part of the SIGGRAPH 2025 NuRec showcase.

News

event [11 Aug 2025] Check out our SIGGRAPH 2025 NuRec demo.
event [23 Sep 2025] Our paper is now available on arXiv, and we have open-sourced the code.

Abstract

The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.

Method

Our pipeline builds upon a camera-controlled video diffusion model (GEN3C) pre-trained on large scale data. We train a 3D Gaussian Splatting (3DGS) decoder by aligning the 2D image renderings of generated 3DGS scenes with the RGB-decoded generations of the pre-trained video model. We only train the 3DGS decoder while freezing the pre-trained autoencoder and diffusion model. We do not rely on the RGB decoder at inference time and directly use the 3DGS decoder. Our framework allows us to distill a pre-trained multi-view diffusion model into a feed-forward 3DGS generator without constructing any groundtruth 3DGS data or using real-world multi-view data.

Generative Image-to-3D

We first perform text-to-image generation and then image-to-3D Gaussians.

Real Image-to-3D

We use a single Waymo Dataset image for image-to-3D Gaussians.

Generative Video-to-4D

We first perform text-to-video generation and then video-to-4D represented as dynamic 3D Gaussians.

Citation


    @article{bahmani2025lyra,
        title={Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation},
        author={Bahmani, Sherwin and Shen, Tianchang and Ren, Jiawei and Huang, Jiahui and Jiang, Yifeng and 
                Turki, Haithem and Tagliasacchi, Andrea and Lindell, David B. and Gojcic, Zan and
                Fidler, Sanja and Ling, Huan and Gao, Jun and Ren, Xuanchi},
        journal={arXiv preprint arXiv:2509.19296},
        year={2025}
    }