Implicit Warping for Animation with Image Sets

NeurIPS 2022

Paper Additional results

We present a new implicit warping framework for image animation using sets of source images through the transfer of the motion of a driving video. A single cross-modal attention layer is used to find correspondences between the source images and the driving image, choose the most appropriate features from different source images, and warp the selected features. This is in contrast to the existing methods that use explicit flow-based warping, which is designed for animation using a single source and does not extend well to multiple sources. The pick-and-choose capability of our framework helps it achieve state-of-the-art results on multiple datasets for image animation using both single and multiple source images.

Driving video reconstruction with multiple source images

Source images	Driving video	FOMM	fv2v	Ours

Overview

What exactly is Implicit Warping trying to solve?

A single image often cannot fully describe the subject due to occlusions, limited pose information, etc. Diverse source images provide more appearance information and reduce the burden of hallucination that an image generator has to perform. Multiple source images provide more complete information, such as the color of the eyes, the texture of the background, etc. This allows for potentially generating an output image that is more faithful to the source setting. Consider the two images shown below. Given just the first image, it is impossible to know what is behind the person. But given the second image, we now know the real background.

Implicit Warping allows you to pick-and-choose features from multiple images to produce the output

The "Why don't you just use X? " Question

Single-source-based prior works such as FOMM, AA-PCA, and fv2v rely on explicit flow-based warping of the source image conditional on the pose of the driving image. Due to this architectural choice, they often have to be modified in ad-hoc ways to take advantage of multiple source images. One scheme is to train an additional pre-processing network to select the most appropriate source image for the given driving image. This would, however, not allow for the use of features from multiple source images at a time. The other possibility is to warp each source image to the driving pose and then average the now-aligned warped features for the generator input. But as is visible in videos below, this leads to sub-optimal results due to the misalignment of warped features and inconsistent predictions across views.

Source images

Driving video

Ours

FOMM

Ours

(single source)

Source images

Driving video

FOMM

fv2v

Ours

Comparing the results from different methods, we can immediately notice a few issues:

Methods developed for a single source image, such as fv2v and FOMM are unable to exactly align features warped from different source images
Furthermore, simple averaging of warped features produces artifacts in the output. This is because averaging does not distinguish between ground truth features and hallucinated features.

In the last column, we present results from implicit warping, which solves both issues raised above. The cross-modal attention layer is able to select the appropriate features and produce an output free of artifacts.
Additional results and visualizations are available at this link.

Citation

@inproceedings{mallya2022implicit,
    title={{Implicit Warping for Animation with Image Sets}},
    author={Arun Mallya and Ting-Chun Wang and Ming-Yu Liu},
    booktitle={NeurIPS},
    year={2022}
}