We synthesize 3D portrait videos that achieve consistent identity under varying head poses, while faithfully capturing dynamic conditions like lighting, expressions, and shoulder poses.
Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user’s appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a 3D avatar built from a single reference image, but fail to faithfully preserve the user’s per-frame appearance (e.g., instantaneous facial expression and lighting). As a result, none of these two frameworks is an ideal solution for democratized 3D telepresence. In this work, we address this dilemma and propose a novel solution that maintains both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that takes the best of both worlds by fusing a canonical 3D prior from a reference view with dynamic appearance from per-frame input views, producing temporally stable 3D videos with faithful reconstruction of the user’s per-frame appearance. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction and temporal consistency on in-studio and in-the-wild datasets.
LP3D shows artifacts and identity distortion as the person rotates their head. Our method maintains coherent reconstruction.
GPAvatar is temporally coherent despite head rotations, but exhibits dampened expressions and fails to capture dynamic conditions like lighting, subtle expressions, and shoulder poses. Our method achieves both coherency and faithful reconstruction of expressions and dynamic conditions.
The size jitters are from off-the-shelf face detectors, whereas GPAvatar only drives a fixed image and thus doesn't show size jitter.
Inference: Given a (near) frontal reference image and an input frame, we reconstruct a canonical triplane and raw triplanes, respectively, using an improved and frozen LP3D. Next, we combine the two triplanes through a Triplane Fuser module that ensures temporal consistency while capturing realtime dynamic conditions, including lighting, expression and pose. The Triplane Fuser consists of (1) the Triplane Undistorter that uses the canonical triplane as a reference to removes distortion (typically caused by challenging head poses) from the raw LP3D triplane; and (2) the Triplane Fuser that leverages the canonical triplane to recover occluded areas and further preserves identity.
Training: Our model is trained with only synthetic video data generated by a 3D-aware GAN (Next3D), with carefully designed augmentation methods to account for shoulder motion and lighting changes. At each training iteration, we randomly sample a pair of FLAME coefficients and landmarks from the pre-processed FFHQ dataset and a single style code corresponding to a random identity, which we input to Next3D to generate the near frontal Reference Image and Input Frame. To supervise the undistortion and fusion processes, we use the frozen LP3D to generate the frontal triplanes, which we use as pseudo-groundtruth for the Undistorted Triplane and the Fused Triplane.
This project is built on top of LP3D
@inproceedings{
Trevithick2023,
author = {Alex Trevithick and Matthew Chan and Towaki Takikawa and Umar Iqbal and Shalini De Mello and Manmohan Chandraker and Ravi Ramamoorthi and Koki Nagano and},
title = {Rendering Every Pixel for High-Fidelity Geometry in 3D GANs},
booktitle = {arXiv},
year = {2023}
}
@inproceedings{stengel2023,
author = {Michael Stengel and Koki Nagano and Chao Liu and Matthew Chan and Alex Trevithick and Shalini De Mello and Jonghyun Kim and David Luebke},
title = {AI-Mediated 3D Video Conferencing},
booktitle = {ACM SIGGRAPH Emerging Technologies},
year = {2023}
}