Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars

1ETH Zurich 2NVIDIA 3Chinese University of Hong Kong

(*Work done while at NVIDIA.)

Dream, Lift, Animate (DLA) is a novel framework that reconstructs animatable 3D human avatars from a single image. We first dream novel views, reconstruct pose-space Gaussians and lift them to the UV-space of the SMPL-X body model. A Gaussian Parameter Decoder enables pose- and view-aware animation. This framework achieves state-of-the-art perceptual quality and photometric accuracy, enabling real-time rendering and intuitive editing without post-processing.

Abstract

We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image. This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians. Given an image, we first dream plausible multi-views using a video diffusion model, capturing rich geometric and appearance details. These views are then lifted into unstructured 3D Gaussians. To enable animation, we propose a transformer-based encoder that models global spatial relationships and projects these Gaussians into a structured latent representation aligned with the UV space of a parametric body model. This latent code is decoded into UV-space Gaussians that can be animated via body-driven deformation and rendered conditioned on pose and viewpoint. By anchoring Gaussians to the UV manifold, our method ensures consistency during animation while preserving fine visual details. DLA enables real-time rendering and intuitive editing without requiring post-processing. Our method outperforms state-of-the-art approaches on ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy. By combining the generative strengths of video diffusion models with a pose-aware UV-space Gaussian mapping, DLA bridges the gap between unstructured 3D representations and high-fidelity, animation-ready avatars.

Method




Gaussian Avatar from a Single Image




In-the-wild Results

We reconstruct and animate inputs from the SHHQ dataset.




Pose-dependent Effects

Our Gaussian Parameter Decoder models pose-dependent effects thanks to its conditioning on the normals and relative vertex positions. The top row visualizes the pose-dependent effects in a frozen pose. Note the changes on the reflective surface and subtle geometric correctives.




View-dependent Effects

Conditioning on the camera parameters via Plucker rays enables view-dependent effects. The visuals rotate the camera around the avatar. Note how the reflections and details changes with the viewpoint. The model learns to allocate more resources to the visible parts of the avatar.

Applications

Although we do not aim to train a generative model, we observe some emerging capabilities. The avatar latent code can be edited and interpolated.

Interpolation

The video interpolates between training subjects.

Editing

We replace the face and shoes of the subject on the left with items from the identity on the right.0




Comparisons

Compared with IDOL (the state-of-the-art), DLA achieves higher-quality and more photorealistic avatars.

We also compare with DreamGaussian, SiTH, and SIFU in the paper.



ActorsHQ Dataset

4D-Dress Dataset




Acknowledgements

We thank Shalini De Mello and Jenny Schmalfuss for their valuable inputs, and Jan Kautz for hosting Marcel's internship.




BibTeX

@misc{buehler2025dla,
      title={Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars},
      author={Marcel C. Buehler, Ye Yuan, Xueting Li, Yangyi Huang, Koki Nagano, Umar Iqbal},
      year={2025},
      eprint={XXXX.XXXXX},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}