Coherent 3D Portrait Video Reconstruction via Triplane Fusion

Abstract

Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user’s appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a 3D avatar built from a single reference image, but fail to faithfully preserve the user’s per-frame appearance (e.g., instantaneous facial expression and lighting). As a result, none of these two frameworks is an ideal solution for democratized 3D telepresence. In this work, we address this dilemma and propose a novel solution that maintains both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that takes the best of both worlds by fusing a canonical 3D prior from a reference view with dynamic appearance from per-frame input views, producing temporally stable 3D videos with faithful reconstruction of the user’s per-frame appearance. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction and temporal consistency on in-studio and in-the-wild datasets.

Links

In-the-Wild Results

Comparison to Per-Frame Reconstruction (LP3D)

LP3D shows artifacts and identity distortion as the person rotates their head. Our method maintains coherent reconstruction.

Comparison to Reenactment (GPAvatar)

GPAvatar is temporally coherent despite head rotations, but exhibits dampened expressions and fails to capture dynamic conditions like lighting, subtle expressions, and shoulder poses. Our method achieves both coherency and faithful reconstruction of expressions and dynamic conditions.

The size jitters are from off-the-shelf face detectors, whereas GPAvatar only drives a fixed image and thus doesn't show size jitter.

Methodology

Coherent and Faithful Reconstruction via Triplane Fusion

Inference

Inference: Given a (near) frontal reference image and an input frame, we reconstruct a canonical triplane and raw triplanes, respectively, using an improved and frozen LP3D. Next, we combine the two triplanes through a Triplane Fuser module that ensures temporal consistency while capturing realtime dynamic conditions, including lighting, expression and pose. The Triplane Fuser consists of (1) the Triplane Undistorter that uses the canonical triplane as a reference to removes distortion (typically caused by challenging head poses) from the raw LP3D triplane; and (2) the Triplane Fuser that leverages the canonical triplane to recover occluded areas and further preserves identity.

Training

Training: Our model is trained with only synthetic video data generated by a 3D-aware GAN (Next3D), with carefully designed augmentation methods to account for shoulder motion and lighting changes. At each training iteration, we randomly sample a pair of FLAME coefficients and landmarks from the pre-processed FFHQ dataset and a single style code corresponding to a random identity, which we input to Next3D to generate the near frontal Reference Image and Input Frame. To supervise the undistortion and fusion processes, we use the frozen LP3D to generate the frontal triplanes, which we use as pseudo-groundtruth for the Undistorted Triplane and the Fused Triplane.

Citation

This project is built on top of LP3D

@inproceedings{
            Trevithick2023,
            author = {Alex Trevithick and Matthew Chan and Towaki Takikawa and Umar Iqbal and Shalini De Mello and Manmohan Chandraker and Ravi Ramamoorthi and Koki Nagano and},
            title = {Rendering Every Pixel for High-Fidelity Geometry in 3D GANs},
            booktitle = {arXiv},
            year = {2023}
          }

LP3D was also developed into a complete 3D Video Conferencing system presented at SIGGRAPH Emerging Technologies 2023

@inproceedings{stengel2023,
            author = {Michael Stengel and Koki Nagano and Chao Liu and Matthew Chan and Alex Trevithick and Shalini De Mello and Jonghyun Kim and David Luebke},
            title = {AI-Mediated 3D Video Conferencing},
            booktitle = {ACM SIGGRAPH Emerging Technologies},
            year = {2023}
          }

Acknowledgments

We thank Marc Chmielewski and Xinjie Yao for helping with data capture. We base this website off of the EG3D website template and the WYSIWYG website template.