We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. We achieve this with a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video.
GEN3C can be easily applied to video/scene creation from a single image ... (Results are generated with Cosmos, see SVD results here).
... or sparse-view images (we use 5 images here) ... (Results are generated with Cosmos, see SVD results here).
... and dynamic videos (even with challenging cinematic effects, such as Dolly Zoom which simultaneously changes poses and intrinsics). Compare the background pixels in the left and right videos. The dolly zoom effect changes the background pixels while maintaining the perspective of the foreground object unchanged.
Overview of GEN3C. With the user input, which can be a single-view image, multi-view images, or dynamic video(s), we first build a spatiotemporal 3D cache by predicting the depth for each image and unprojecting it into 3D. With the camera poses from the user, we then render the cache into video(s), which are fed into the video diffusion model to generate a photorealistic video that aligns with the desired camera poses.
GEN3C generates realistic novel camera trajectories from the given monocular dynamic video. Our model generalizes to out-of-domain videos. Videos on the left are the inputs (generated by Sora or MovieGen), and the ones on the right are our retouched outputs with camera control.
GEN3C is able to generate a realistic viewpoint change of the original video for driving simulation. (Results are generated with Cosmos, see SVD results here).
@inproceedings{ren2025gen3c,
title={GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control},
author={Ren, Xuanchi and Shen, Tianchang and Huang, Jiahui and Ling, Huan and
Lu, Yifan and Nimier-David, Merlin and Müller, Thomas and Keller, Alexander and
Fidler, Sanja and Gao, Jun},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}
SVD single view gallery results. (View Cosmos results)
SVD drone gallery results. (View Cosmos results)
Qualitative results on two-view novel view synthesis. GEN3C generates realistic videos from just two input images, capturing photorealistic view-dependent lighting effects such as dynamic lighting and reflections (e.g., reflections on the piano).
Ours vs. MVSplat
Ours vs. PixelSplat
Compared to the baselines, our model generates much more plausible and realistic novel views with a smooth transition between two input views even in cases with minimal overlap and varying lighting conditions.
Qualitative results for single-view to video generation. Compared to baselines, GEN3C generates photorealistic novel view images that precisely align with the camera poses.
SVD driving simulation results. (View Cosmos results)
Ours vs. Nerfacto
Ours vs. 3D-GS
Compared to reconstruction-based methods, the novel views generated by our method contains much less artifacts and is more plausible.
Original
Left 3m
GEN3C can also generate the novel viewpoint of dynamic driving video.
Removal
Editing
With an explicit 3D cache, we further support 3D editing such as object removal or editing the trajectory of objects for driving simulation.