Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars

¹ETH Zurich ²NVIDIA ³Chinese University of Hong Kong

(^*Work done while at NVIDIA.)

Abstract

We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image. This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians. Given an image, we first dream plausible multi-views using a video diffusion model, capturing rich geometric and appearance details. These views are then lifted into unstructured 3D Gaussians. To enable animation, we propose a transformer-based encoder that models global spatial relationships and projects these Gaussians into a structured latent representation aligned with the UV space of a parametric body model. This latent code is decoded into UV-space Gaussians that can be animated via body-driven deformation and rendered conditioned on pose and viewpoint. By anchoring Gaussians to the UV manifold, our method ensures consistency during animation while preserving fine visual details. DLA enables real-time rendering and intuitive editing without requiring post-processing. Our method outperforms state-of-the-art approaches on ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy. By combining the generative strengths of video diffusion models with a pose-aware UV-space Gaussian mapping, DLA bridges the gap between unstructured 3D representations and high-fidelity, animation-ready avatars.

Method

Overview

Overview of the Dream, Lift, Animate framework. In the Dream stage, we synthesize novel views from the input using a diffusion-based generator. In the Lift stage, we project the multi-view images into a set of unstructured 3D Gaussians in the pose space using a learned Gaussian reconstruction model. Subsequently, we learn a transformer encoder to map 3D Gaussians to a structured latent code in the UV space of a parametric body model. In the Animate stage, we decode the avatar code into a pose- and view-aware Gaussian parameter map. This structured representation enables realistic animation and rendering via deformation with a body model.

Dream

In the Dream stage, we synthesize novel views from the input using a diffusion-based generator (inference). For training, we use the ground-truth views, where available.

Lift

In the Lift stage, we project the multi-view images into a set of unstructured 3D Gaussians in the pose space using a learned Gaussian reconstruction model. Subsequently, we learn a transformer encoder to map 3D Gaussians to a structured latent code in the UV space of a parametric body model.

Animate

In the Animate stage, we decode the avatar code into a pose- and view-aware Gaussian parameter map. This structured representation enables realistic animation and rendering via deformation with a body model. The decoder is trained to model pose-dependent effects thanks to its conditioning on the normals and relative vertex positions.

Pose-dependent Effects

Our Gaussian Parameter Decoder models pose-dependent effects thanks to its conditioning on the normals and relative vertex positions. The top row visualizes the pose-dependent effects in a frozen pose. Note the changes on the reflective surface and subtle geometric correctives.

View-dependent Effects

Conditioning on the camera parameters via Plucker rays enables view-dependent effects. The visuals rotate the camera around the avatar. Note how the reflections and details changes with the viewpoint. The model learns to allocate more resources to the visible parts of the avatar.

BibTeX

@misc{buehler2025dla, title={Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars}, author={Marcel C. Buehler, Ye Yuan, Xueting Li, Yangyi Huang, Koki Nagano, Umar Iqbal}, year={2025}, eprint={XXXX.XXXXX}, archivePrefix={arXiv}, primaryClass={cs.CV} }

Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars

Abstract

Method

Overview

Dream

In the Dream stage, we synthesize novel views from the input using a diffusion-based generator (inference). For training, we use the ground-truth views, where available.

Lift

In the Lift stage, we project the multi-view images into a set of unstructured 3D Gaussians in the pose space using a learned Gaussian reconstruction model. Subsequently, we learn a transformer encoder to map 3D Gaussians to a structured latent code in the UV space of a parametric body model.

Animate

Gaussian Avatar from a Single Image

In-the-wild Results

Pose-dependent Effects

View-dependent Effects

Applications

Interpolation

Editing

Comparisons

ActorsHQ Dataset

4D-Dress Dataset

Acknowledgements

BibTeX