Instant Expressive Gaussian Head Avatars at Over 100 FPS

ECCV 2026

TL;DR: We learn an ⚡ instant encoder (20ms) that lifts images into expressive and real-time (>100fps) animatable 3D avatars via distillation from a 2D diffusion model.

Live Demo

Abstract

Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware feedforward facial animation methods -- built upon 3D representations, such as neural radiance fields or Gaussian splatting -- ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we address this portrait animation trilemma (speed, 3D consistency, and expressiveness) and propose a pipeline that instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation via a feed-forward encoder. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. Furthermore, our animation representation is decoupled from the face's 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Our method runs at 107.31 FPS for animation and pose control, representing a 3-4 order of magnitude speedup versus the state of the art while achieving comparable animation quality, thus surpassing alternative designs that trade speed for quality or vice versa.

Results

Visualization of Quantitative Comparison with state-of-the-art 3D-aware Methods and 2D Methods

We provide a visualization of the quantitative comparison in terms of 3D inconsistency (MEt3R ↓), expression transfer inaccuracy (AED ↓) and animation speed (FPS ↑, visualized as circle size) with other baselines[1-7] using cross-reenactment. 2D methods tend to appear upper-left (better expression transfer, worse 3D consistency), while 3D methods tend to appear lower-right (worse expression transfer, better 3D consistency). Our method is 3–4 orders of magnitude faster than diffusion-based models [1,2] while simultaneously achieving better 3D consistency and expression transfer accuracy.

Comparison with other 3D-aware Methods

Our FPS measurement reflects the real speed in production environment where we consider all necessary time cost including preprocessing, motion encoding and etc.

Comparison with other 2D Methods

Our FPS measurement reflects the real speed in production environment where we consider all necessary time cost including preprocessing, motion encoding and etc.

Citation

@inproceedings{JiangInstant2025,
    author = {Kaiwen Jiang and Seonwook Park and Xueting Li and Ravi Ramamoorthi and Shalini De Mello and Koki Nagano},
    title = {Instant Expressive Gaussian Head Avatars at Over 100 FPS},
    booktitle = {arXiv},
    year = {2025}
}

Acknowledgments

We thank David Luebke, Michael Stengel, Yeongho Seol, Simon Yuen, Marcel Bühler, and Arash Vahdat for feedback on drafts and early discussions. We also thank Alex Trevithick and Tianye Li for proof-reading. This research was also funded in part by the Ronald L. Graham Chair and the UC San Diego Center for Visual Computing. We base this website off of the WYSIWYG website template.

References

[1] Xiaochen Zhao, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xiu Li, Linjie Luo, Jinli Suo, and Yebin Liu. X-nemo: Expressive neural motion reenactment via disentangled latent attention. ICLR, 2025.

[2] Zunnan Xu, Zhentao Yu, Zixiang Zhou, Jun Zhou, Xiaoyu Jin, Fa-Ting Hong, Xiaozhong Ji, Junwei Zhu, Chengfei Cai, Shiyu Tang, et al. Hunyuanportrait: Implicit condition control for enhanced portrait animation. CVPR, 2025.

[3] Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Live-portrait: Efficient portrait animation with stitching and retargeting control. arXiv, 2024.

[4] Xuangeng Chu and Tatsuya Harada. Generalizable and animatable gaussian head avatar. NeurIPS, 2024.

[5] Yu Deng, Duomin Wang, and Baoyuan Wang. Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. ECCV, 2024.

[6] Phong Tran, Egor Zakharov, Long-Nhat Ho, Liwen Hu, Adilbek Karmanov, Aviral Agarwal, McLean Goldwhite, Ariana Bermudez Venegas, Anh Tuan Tran, and Hao Li. Voodoo xp: Expressive one-shot head reenactment for vr telepresence. SIGGRAPH Asia, 2024.

[7] Xiaochen Zhao, Jingxiang Sun, Lizhen Wang, Jinli Suo, and Yebin Liu. Invertavatar: Incremental gan inversion for generalized head avatars. SIGGRAPH, 2024.

[8] Alex Trevithick, Matthew Chan, Michael Stengel, Eric R. Chan, Chao Liu, Zhiding Yu, Sameh Khamis, Manmohan Chandraker, Ravi Ramamoorthi, and Koki Nagano. Real-time radiance fields for single-image portrait view synthesis. ACM Transactions on Graphics (SIGGRAPH), 2023.