Toronto AI Lab

L4GM:

Large 4D Gaussian Reconstruction Model

1 NVIDIA, 2 University of Toronto, 3 University of Cambridge, 4 MIT, 5 S-Lab, Nanyang Technological University



Video-to-4D Synthesis
L4GM generates 4D objects from videos within seconds.
Hover over the model samples to see the input videos.


Reconstructing Long, High-FPS, In-the-wild Videos
L4GM is able to reconstruct 10-second long 30-fps videos.
Hover over the examples to see the input videos.


4D Interpolation
We train a 4D interpolation model that increases the framerate by 3x.
Hover over the 4D samples to see the input videos. Left: before interpolation. Right: after interpolation.




Abstract


We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input -- in a single feed-forward pass that takes only a second. Key to our success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames. We keep our L4GM simple for scalability and build directly on top of LGM, a pretrained 3D Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview image input. L4GM outputs a per-frame 3D Gaussian Splatting representation from video frames sampled at a low fps and then upsamples the representation to a higher fps to achieve temporal smoothness. We add temporal self-attention layers to the base LGM to help it learn consistency across time, and utilize a per-timestep multiview rendering loss to train the model. The representation is upsampled to a higher framerate by training an interpolation model which produces intermediate 3D Gaussian representations. We showcase that L4GM that is only trained on synthetic data generalizes extremely well on in-the-wild videos, producing high quality animated 3D assets.



Large 4D Gaussian Reconstruction Model


Our model takes a single-view video and single-time step multiview images as input, and outputs a set of 4D Gaussians. It adopts a U-Net architecture and uses cross-view self-attention for view consistency and temporal cross-time self-attention for temporal consistency.

L4GM allows autoregressive reconstruction. We use the multiview rendering of the last Gaussians as the input to the next reconstruction. There is a one-frame overlap between two consecutive reconstructions. Additionally, we train a 4D interpolation model. The interpolation model takes in the interpolated multiview videos rendered from the reconstruction results and outputs interpolated Gaussians.

Paper


L4GM:
Large 4D Gaussian Reconstruction Model

Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xiaohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, Huan Ling

description Paper
insert_comment BibTeX

Citation


@misc{ren2024l4gm,
      title={L4GM: Large 4D Gaussian Reconstruction Model}, 
      author={Jiawei Ren and Kevin Xie and Ashkan Mirzaei and Hanxue Liang and Xiaohui Zeng and Karsten Kreis and Ziwei Liu and Antonio Torralba and Sanja Fidler and Seung Wook Kim and Huan Ling},
      year={2024},
      eprint={2406.10324},
      archivePrefix={arXiv},
      primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'}
}