Speech-driven animation of a portrait, with control over the output pose, emotions, and intensities of expressions

Pose     Generated Transferred Generated Generated
Emotion     Neutral Neutral Happy Surprise
Emotion     Neutral Neutral Sad Fear
Pose     Generated Transferred Generated Generated

Abstract

We present SPACE, a method for generating high-resolution, expressive videos with realistic head pose, using just speech and a single image. It uses a multi-stage approach, combining the controllability of facial landmarks with the high-quality synthesis power of a pretrained face generator. SPACE also allows for the control of emotions and their intensities. Our method outperforms prior methods in objective metrics for image quality and facial motions and is strongly preferred by users in pair-wise comparisons.

Overview

Speech-driven portrait animation concerns animating a still image of a face using an arbitrary input speech signal. SPACE allows you to animate a photo using just speech, with unprecedented controllability over outputs — head pose, emotion label and intensity, blinking, and eye gaze control.

Comparison with Prior Work

Input image PC-AVS MakeItTalk Wav2Lip SPACE (ours)
Input image
Input image
Input image
Input image
Input image

Method

SPACE decomposes the task of speech to face animation into 3 stages: 1) Speech2Landmarks (S2L), 2) Landmarks2Latents (L2L), and 3) Video Synthesis. Predicting facial landmarks helps add modifications such as blinking. We can also apply any desired rotation, translation, and scaling to the 3D facial landmarks. Instead of learning to generate a high-quality output image from facial landmarks, we use a state-of-the-art pretrained face-vid2vid generator.

SPACE method overview

Intermediate Predictions

Input Intermediate predictions Final
Single Image Normalized facial landmarks Posed facial landmarks Latent face-vid2vid keypoints Animated output
Input
Input

Emotion Control

We condition both the Speech2Landmark and Landmark2Latent models on the emotion using FiLM layers. At inference, we can provide the desired combination of emotion labels and their intensities as input.

Input image 0.5 Happy 1.0 Happy
Input image
Input image 0.5 Angry 1.0 Angry

Eye Control — Blinking and Gaze

By manipulating the intermediate facial landmarks corresponding to the eyes, we can introduce eye blinking motion. As face-vid2vid allows control over the eye gaze, we are also able to control the gaze of the output.

Input image Blinking Gaze change
Input image
Input image

Citation

@inproceedings{gururani2023SPACE,
  title={{SPACE: Speech-driven Portrait Animation with Controllable Expression}},
  author={Gururani, Siddharth and Mallya, Arun and Wang, Ting-Chun and Valle, Rafael and Liu, Ming-Yu},
  booktitle={IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2023}
}