Coherent 3D Portrait Video Reconstruction via Triplane Fusion

Abstract

Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, potentially democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits for telepresence applications by driving a personalized 3D prior, but fail to faithfully reconstruct the user's per-frame appearance (e.g. facial expressions and lighting). In this work, we recognize the need to maintain both personalized stable appearance and dynamic video conditions to enable the best possible user experience. To this end, we propose a new fusion-based 3D portrait reconstruction method, which captures the authentic dynamic appearance of the user while fusing it with a personalized 3D subject prior, producing temporally stable 3D videos with consistent personalized appearance and structure. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction accuracy and temporal consistency on in-studio and in-the-wild datasets.

Links

In-the-Wild Results

Comparison to Per-Frame Reconstruction (LP3D)

LP3D shows artifacts and identity distortion as the person rotates their head. Our method maintains coherent reconstruction.

Comparison to Reenactment (GPAvatar)

GPAvatar is temporally coherent despite head rotations, but exhibits dampened expressions and fails to capture dynamic conditions like lighting, subtle expressions, and shoulder poses. Our method achieves both coherency and faithful reconstruction of expressions and dynamic conditions.

The size jitters are from off-the-shelf face detectors, whereas GPAvatar only drives a fixed image and thus doesn't show size jitter.

Methodology

Coherent and Faithful Reconstruction via Triplane Fusion

Inference

Inference: Given a (near) frontal reference image and an input frame, we reconstruct a triplane prior and raw triplanes, respectively, using an improved and frozen LP3D. Next, we combine the two triplanes through a Triplane Fuser module that ensures temporal consistency while capturing realtime dynamic conditions, including lighting, expression and pose. The Triplane Fuser consists of (1) the Triplane Undistorted that uses the triplane prior as a reference to removes distortion (typically caused by challenging head poses) from the raw LP3D triplane; and (2) the Triplane Fuser that leverages the triplane prior to recover occluded areas and further preserves identity.

Training

Training: Our model is trained with only synthetic video data generated by a 3D-aware GAN (Next3D), with carefully designed augmentation methods to account for shoulder motion and lighting changes. At each training iteration, we randomly sample a pair of FLAME coefficients and landmarks from the pre-processed FFHQ dataset and a single style code corresponding to a random identity, which we input to Next3D to generate the near frontal Reference Image and Input Frame. To supervise the undistortion and fusion processes, we use the frozen LP3D to generate the frontal triplanes, which we use as pseudo-groundtruth for the Undistorted Triplane and the Fused Triplane.

Citation

@misc{wang2024coherent,
      title={Coherent 3D Portrait Video Reconstruction via Triplane Fusion}, 
      author={Shengze Wang and Xueting Li and Chao Liu and Matthew Chan and Michael Stengel and Josef Spjut and Henry Fuchs and Shalini De Mello and Koki Nagano},
      year={2024},
      eprint={2405.00794},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

This project is built on top of LP3D

@inproceedings{
            Trevithick2023,
            author = {Alex Trevithick and Matthew Chan and Towaki Takikawa and Umar Iqbal and Shalini De Mello and Manmohan Chandraker and Ravi Ramamoorthi and Koki Nagano and},
            title = {Rendering Every Pixel for High-Fidelity Geometry in 3D GANs},
            booktitle = {arXiv},
            year = {2023}
          }

LP3D was also developed into a complete 3D Video Conferencing system presented at SIGGRAPH Emerging Technologies 2023

@inproceedings{stengel2023,
            author = {Michael Stengel and Koki Nagano and Chao Liu and Matthew Chan and Alex Trevithick and Shalini De Mello and Jonghyun Kim and David Luebke},
            title = {AI-Mediated 3D Video Conferencing},
            booktitle = {ACM SIGGRAPH Emerging Technologies},
            year = {2023}
          }

Acknowledgments

We thank Marc Chmielewski and Xinjie Yao for helping with data capture. We base this website off of the EG3D website template and the WYSIWYG website template.