GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

¹ NVIDIA

² University of Toronto

³ Vector Institute

* Equal Contribution

CVPR 2025 Highlight

article Paper description Supp PDF code Code format_quote BibTeX

Abstract

We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. We achieve this with a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video.

GEN3C can be easily applied to video/scene creation from a single image ... (Results are generated with Cosmos, see SVD results here).

... or sparse-view images (we use 5 images here) ... (Results are generated with Cosmos, see SVD results here).

... and dynamic videos (even with challenging cinematic effects, such as Dolly Zoom which simultaneously changes poses and intrinsics). Compare the background pixels in the left and right videos. The dolly zoom effect changes the background pixels while maintaining the perspective of the foreground object unchanged.

Method

Overview of GEN3C. With the user input, which can be a single-view image, multi-view images, or dynamic video(s), we first build a spatiotemporal 3D cache by predicting the depth for each image and unprojecting it into 3D. With the camera poses from the user, we then render the cache into video(s), which are fed into the video diffusion model to generate a photorealistic video that aligns with the desired camera poses.

Monocular Dynamic Novel View Synthesis

GEN3C generates realistic novel camera trajectories from the given monocular dynamic video. Our model generalizes to out-of-domain videos. Videos on the left are the inputs (generated by Sora or MovieGen), and the ones on the right are our retouched outputs with camera control.

Driving Simulation

GEN3C is able to generate a realistic viewpoint change of the original video for driving simulation. (Results are generated with Cosmos, see SVD results here).

Citation


    @inproceedings{ren2025gen3c,
        title={GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control},
        author={Ren, Xuanchi and Shen, Tianchang and Huang, Jiahui and Ling, Huan and 
            Lu, Yifan and Nimier-David, Merlin and M\"uller, Thomas and Keller, Alexander and 
            Fidler, Sanja and Gao, Jun},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
        year={2025}
    }

SVD Results with Comparisons (click to expand)

SVD single view gallery results. (View Cosmos results)

SVD drone gallery results. (View Cosmos results)

Two-View Novel View Synthesis

Qualitative results on two-view novel view synthesis. GEN3C generates realistic videos from just two input images, capturing photorealistic view-dependent lighting effects such as dynamic lighting and reflections (e.g., reflections on the piano).

Ours vs. MVSplat

Ours vs. PixelSplat

Compared to the baselines, our model generates much more plausible and realistic novel views with a smooth transition between two input views even in cases with minimal overlap and varying lighting conditions.

Single-view to Video Generation

Qualitative results for single-view to video generation. Compared to baselines, GEN3C generates photorealistic novel view images that precisely align with the camera poses.

SVD Driving Simulation (click to expand)