Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos

Hanxue Liang1,2* Jiawei Ren1,3* Ashkan Mirzaei1,4* Antonio Torralba1,5 Ziwei Liu3 Igor Gilitschenski4 Sanja Fidler1,4,6 Cengiz Oztireli2 Huan Ling1,4,6 Zan Gojcic1† Jiahui Huang1† 1 NVIDIA 2 University of Cambridge 3 Nanyang Technological University 4 University of Toronto 5 MIT 6 Vector Institute */†: Equal contribution/advising

What do we do?

Framework figure

BTimer: Given a monocular video as input, our method reconstructs a 3D Gaussian Splatting (3DGS) representation at any desired timestamp in a feed-forward fashion.

Technical Contributions

We present the first feed-forward reconstruction model for dynamic scenes using a bullet-time formulation;

BTimer reconstructs a bullet-time scene within 150ms while reaching state-of-the-art performance on both static and dynamic scene datasets, even compared with optimization-based approaches.

Method Framework

Framework figure

The model takes as input a sequence of context frames and their Plücker embeddings, along with the context timestamp and target (`bullet') timestamp embeddings. It then directly predicts the 3DGS representation at the bullet timestamp.

Results

Qualitative Results

Video 1 Video 8 Video 2 Video 3 Video 4 Video 5 Video 6 Video 7 Video 7
Qualitative Results on SORA scenes. We pause the video at a frame and move the camera to make a bullet-time effect.
Video 1 Video 2 Video 3 Video 4 Video 5 Video 6 Video 7 Video 8 Video 10 Video 11 Video 12 Video 13 Video 14 Video 15
Qualitative Results on DAVIS dataset. Left is the input video, right is the video rendered from a novel camera trajectory.

Qualitative Results on DyCheck iPhone dataset. Left is the input video, right two are the videos rendered from novel camera trajectories.

Dynamic Novel View Synthesis Benchmark Results

Quantitative Comparison
Baseline results on NVIDIA Dynamic Scene Dataset.
Baseline comparisons on DyCheck iPhone Scenes. We mask areas that are not co-visible.

Results on Static Scenes

Results on the Tanks & Temples static scenes benchmark.

Effect of the NTE Module

Citation


@article{liang2024btimer,
  title={Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos},
  author={Liang, Hanxue and Ren, Jiawei and Mirzaei, Ashkan and Torralba, Antonio and Liu, Ziwei and Gilitschenski, Igor and Fidler, Sanja and Oztireli, Cengiz and Ling, Huan and Gojcic, Zan and Huang, Jiahui},
  title={arXiv preprint arXiv:2412.03526},
  year={2024}
  }