TL;DR: PlenopticDreamer re-renders input video under novel camera trajectories while preserving long-term spatio-temporal memory in an autoregressive manner.

Abstract

We present PlenopticDreamer, a method for plenoptic video generation that re-renders input video under novel camera trajectories while preserving long-term spatio-temporal memory in an autoregressive manner. Its core is an autoregressive multi-camera video generator that retrieves video–camera pairs from the memory bank using a 3D FOV-based video retrieval strategy. Conditioned on the retrieved pairs and the target camera, the model performs noisy scheduling and learnable reconstruction to generate the target video. To enable long video generation, a portion of the preceding frames is preserved as clean inputs during training.

Results & Capabilities

Memory-Preserving Multi-Shot Generation

PlenopticDreamer enables multi-shot video re-rendering while preserving spatio-temporal memory from previously generated shots.

Long Video Generation

Our method supports consistently long-range video re-rendering across various frames.

Focal Length Effect

Our method simulates varying depth-of-field effects corresponding to different focal lengths (18mm→100mm). Shorter focal length leads to larger change in field-of-view (FOV) in generated video.

Method

PlenopticDreamer pipeline

Its core is an autoregressive multi-camera video generator that retrieves $k$ video–camera pairs $\{(\mathbf{P}^n, \mathbf{V}^n)\}_{n=1}^k$ from the memory bank using a 3D FOV–based video retrieval strategy. Conditioned on these retrieved pairs and the target camera $\mathbf{P}^{k+1}$, the model performs noisy scheduling and learnable reconstruction to generate the target video $\mathbf{V}^{k+1}$. To enable long video generation, a portion of the preceding frames in $\mathbf{V}^{k+1}$ is preserved as clean inputs at a certain ratio during training. Within each DiT block, temporal concatenation is applied to form video tokens $\mathbf{x}$ as in-context condition.

Evaluation

Comparison with SOTAs

Main comparison — basic
Main comparison — AgiBot

Ablation Study

(1) w/o Self-Cond. Training: Removing the self-conditioned training strategy described in Sec. 3.3 leads to noticeably degraded visual quality.
(2) w/ Random Context Videos: Replacing the video retrieval strategy with random retrieval results in degraded view synchronization.
(3) w/o Progressive Training: Removing the progressive training strategy described in Sec. 3.3 leads to poor camera accuracy.

Citation

@inproceedings{fu2025plenoptic,
  title={Plenoptic Video Generation},
  author={Fu, Xiao and Tang, Shitao and Shi, Min and Liu, Xian and Gu, Jinwei and Liu, Ming-Yu and Lin, Dahua and Lin, Chen-Hsuan},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}