NVIDIA Logo Spatial Intelligence Lab
PDF Read Paper GitHub Code Hugging Face Lyra 2.0 Citation

TL;DR: We generate camera-controlled walkthrough videos and lift them to 3D via feed-forward reconstruction. To enable long-horizon, 3D-consistent generation, we address spatial forgetting with per-frame geometry for information routing and temporal drifting with self-augmented training that teaches the model to correct drift.

Abstract


Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation.

Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model's temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry.

We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing—retrieving relevant past frames and establishing dense correspondences with the target viewpoints—while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.

Method


Method overview: long-horizon video generation and 3D reconstruction

Method overview. (Left) Given an input image, Lyra 2.0 iteratively generates video segments guided by a user‑defined camera trajectory from an interactive 3D explorer and an optional text prompt, lifting each segment into 3D point clouds fed back for continued navigation. Generated video frames are finally reconstructed and exported as 3D Gaussians or meshes. (Right) At each step, history frames with maximal visibility of the target views are retrieved from the spatial memory. Their canonical coordinates are warped to establish dense 3D correspondences and injected into DiT via attention, together with compressed temporal history.


Acknowledgments

The authors would like to thank Product Managers Aditya Mahajan and Matt Cragun for their valuable guidance and support. We sincerely acknowledge Merlin Nimier-David, Thomas Mueller-Hoehne, and Alex Keller for their foundational interactive GUI, which our system builds upon. We also thank Oliver Hahn, David Pankratz, Christian Laforte, Gene Liu, and Rafal Karp for insightful discussions and feedback, and Yifeng Jiang, Nicolas Moenne-Loccoz, Tanki Zhang, Aditya Gupta, and Gavriel State for their prompt and invaluable help in creating the Isaac Sim demo.

Citation

@article{shen2026lyra2,
    title={Lyra 2.0: Explorable Generative 3D Worlds},
    author={Shen, Tianchang and Bahmani, Sherwin and He, Kai and Srinivasan, Sangeetha Grama and Cao, Tianshi and Ren, Jiawei and Li, Ruilong and Wang, Zian and Sharp, Nicholas and Gojcic, Zan and Fidler, Sanja and Huang, Jiahui and Ling, Huan and Gao, Jun and Ren, Xuanchi},
    journal={arXiv preprint arXiv:2604.13036},
    year={2026}
}