Toronto AI Lab NVIDIA Research

GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

1 NVIDIA
2 University of Toronto
3 Vector Institute

* Equal Contribution

CVPR 2025

Abstract


We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. We achieve this with a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video.


Monocular Dynamic Novel View Synthesis


GEN3C generates realistic novel camera trajectories from the given monocular dynamic video. Our model generalizes to out-of-domain videos. Videos on the left are the inputs (generated by Sora or MovieGen), and the ones on the right are our retouched outputs with camera control.


Driving Simulation


GEN3C is able to generate a realistic viewpoint change of the original video for driving simulation. (Results are generated with Cosmos, see SVD results here).

Citation



    @inproceedings{ren2025gen3c,
        title={GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control},
        author={Ren, Xuanchi and Shen, Tianchang and Huang, Jiahui and Ling, Huan and 
            Lu, Yifan and Nimier-David, Merlin and Müller, Thomas and Keller, Alexander and 
            Fidler, Sanja and Gao, Jun},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
        year={2025}
    }
                
            
SVD Results with Comparisons (click to expand)

SVD single view gallery results. (View Cosmos results)

SVD drone gallery results. (View Cosmos results)


Two-View Novel View Synthesis


Qualitative results on two-view novel view synthesis. GEN3C generates realistic videos from just two input images, capturing photorealistic view-dependent lighting effects such as dynamic lighting and reflections (e.g., reflections on the piano).


Ours vs. MVSplat

Ours vs. PixelSplat

Compared to the baselines, our model generates much more plausible and realistic novel views with a smooth transition between two input views even in cases with minimal overlap and varying lighting conditions.


Single-view to Video Generation


Qualitative results for single-view to video generation. Compared to baselines, GEN3C generates photorealistic novel view images that precisely align with the camera poses.

SVD Driving Simulation (click to expand)

SVD driving simulation results. (View Cosmos results)



Ours vs. Nerfacto

Ours vs. 3D-GS

Compared to reconstruction-based methods, the novel views generated by our method contains much less artifacts and is more plausible.


Original

Left 3m

GEN3C can also generate the novel viewpoint of dynamic driving video.



Removal

Editing

With an explicit 3D cache, we further support 3D editing such as object removal or editing the trajectory of objects for driving simulation.