Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation

TL;DR: We learn an ⚡ instant encoder (20ms) that lifts images into expressive and real-time (>100fps) animatable 3D avatars via distillation from a 2D diffusion model.

Live Demo

Abstract

Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods -- built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting -- ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face's 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. As a result, our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art, surpassing alternative designs that trade speed for quality or vice versa.

Links

Results

Visualization of Quantitative Comparison with state-of-the-art 3D-aware Methods and 2D Methods

We provide a visualization of our quantitative comparisons against state-of-the-art methods in terms of 3D inconsistency (MEt3R ↓), expression transfer inaccuracy (AED ↓) and animation speed (FPS ↑, visualized as circle size). We compare against 7 state-of-the-art baselines - X-NeMo, HunyuanPortrait, Live-Portrait, GAGAvatar, Portrait4D-v2, VOODOO-XP, and InvertAvatar - using cross-reenactment. 2D methods tend to appear upper-left (better expression transfer, worse 3D consistency), while 3D methods tend to appear lower-right (worse expression transfer, better 3D consistency). Our method is 3–4 orders of magnitude faster than diffusion-based models (e.g. X-NeMo, HunYuanPortrait) while simultaneously achieving better 3D consistency and expression transfer accuracy.

Comparison with other 3D-aware Methods

Note: our FPS measurement includes all necessary time cost such as those from preprocessing, motion encoding and etc., indicating potential performance in production environments.

Comparison with other 2D Methods

Note: our FPS measurement includes all necessary time cost such as those from preprocessing, motion encoding and etc., indicating potential performance in production environments.

Citation

@inproceedings{JiangInstant2025,
    author = {Kaiwen Jiang and Xueting Li and Seonwook Park and Ravi Ramamoorthi and Shalini De Mello and Koki Nagano},
    title = {Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation},
    booktitle = {arXiv},
    year = {2025}
}

Acknowledgments

We thank David Luebke, Michael Stengel, Yeongho Seol, Simon Yuen, Marcel Bühler, and Arash Vahdat for feedback on drafts and early discussions. We also thank Alex Trevithick and Tianye Li for proof-reading. This research was also funded in part by the Ronald L. Graham Chair and the UC San Diego Center for Visual Computing. We base this website off of the WYSIWYG website template.

References

[1] Xiaochen Zhao, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xiu Li, Linjie Luo, Jinli Suo, and Yebin Liu. X-nemo: Expressive neural motion reenactment via disentangled latent attention. ICLR, 2025.

[2] Zunnan Xu, Zhentao Yu, Zixiang Zhou, Jun Zhou, Xiaoyu Jin, Fa-Ting Hong, Xiaozhong Ji, Junwei Zhu, Chengfei Cai, Shiyu Tang, et al. Hunyuanportrait: Implicit condition control for enhanced portrait animation. CVPR, 2025.

[3] Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Live-portrait: Efficient portrait animation with stitching and retargeting control. arXiv, 2024.

[4] Xuangeng Chu and Tatsuya Harada. Generalizable and animatable gaussian head avatar. NeurIPS, 2024.

[5] Yu Deng, Duomin Wang, and Baoyuan Wang. Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. ECCV, 2024.

[6] Phong Tran, Egor Zakharov, Long-Nhat Ho, Liwen Hu, Adilbek Karmanov, Aviral Agarwal, McLean Goldwhite, Ariana Bermudez Venegas, Anh Tuan Tran, and Hao Li. Voodoo xp: Expressive one-shot head reenactment for vr telepresence. SIGGRAPH Asia, 2024.

[7] Xiaochen Zhao, Jingxiang Sun, Lizhen Wang, Jinli Suo, and Yebin Liu. Invertavatar: Incremental gan inversion for generalized head avatars. SIGGRAPH, 2024.

[8] Alex Trevithick, Matthew Chan, Michael Stengel, Eric R. Chan, Chao Liu, Zhiding Yu, Sameh Khamis, Manmohan Chandraker, Ravi Ramamoorthi, and Koki Nagano. Real-time radiance fields for single-image portrait view synthesis. ACM Transactions on Graphics (SIGGRAPH), 2023.