We introduce Video Storyboarding a training-free approach for generating video shots with recurring subjects from input prompts. It enables character consistency, while maintaining motion adherence to prompts, in pretrained text-to-video models without any finetuning. Our approach offers insight into the representation of motion, structure, and identity in diffusion models, providing a strategy that balances these aspects.

Abstract

Text-to-video models have made significant strides in generating short video clips from textual descriptions. Yet, a significant challenge remains: generating several video shots of the same characters, preserving their identity without hurting video quality, dynamics, and responsiveness to text prompts. We present Video Storyboarding, a training-free method to enable pretrained text-to-video models to generate multiple shots with consistent characters, by sharing features between them. Our key insight is that self-attention query features (Q) encode both motion and identity. This creates a hard-to-avoid trade-off between preserving character identity and making videos dynamic, when features are shared. To address this issue, we introduce a novel query injection strategy that balances identity preservation and natural motion retention. This approach improves upon naive consistency techniques applied to videos, which often struggle to maintain this delicate equilibrium. Our experiments demonstrate significant improvements in character consistency across scenes while maintaining high-quality motion and text alignment. These results offer insights into critical stages of video generation and the interplay of structure and motion in video diffusion models.

How does it works?

Architecture outline: Our consistent denoising process has two phases: Q Preservation and Q Flow. We first generate and cache video shots using ``vanilla'' VideoCrafter2. In Q Preservation (T → t_pres), we use Vanilla Q Injection to maintain motion structure by replacing our Q values with vanilla ones. In Q Flow (t_pres → t₀), we use a flow map from vanilla key frames to guide Q feature injection. This phase maintains character identity by allowing the use of Q features from our consistent denoising process, while the flow map ensures that these identity-preserving features are applied in a way that's consistent with the original motion. Throughout, we employ two complementary techniques: framewise subject-driven self-attention for visual coherence, and refinement feature injection to reinforce character consistency across diverse prompts.

Results

Additional Qualitative Examples (Click to collapse)

Unreal Engine animated scene, Bird

baking cookies

riding a roller

playing w. trees

80's Sesame Street scene, a Muppet

do aerobics

magic, pull scarves

skate, wobbling

Camcorder, patchwork stuffed rabbit toy

surfing train

bouncing trampoline

riding bike

Qualitative Comparisons (Fig. 3 in paper)

Video Storyboarding (top) demonstrates improved character while maintaining natural motion. VideoCrafter2 shows diverse motion but inconsistent characters, Tokenflow-Encoder struggles with consistency and has color artifacts, ConsiS Im2Vid lacks both consistency and motion fidelity, and VSTAR shows limited prompt adherence despite good identity preservation.

Ablation Study - Q intervention strategies (Fig. 4 in paper)

Different Q injection strategies show distinct trade-offs: Video Storyboarding (top row) balances character consistency and natural motion. VideoCrafter2 offers diverse motion but loses character consistency, Full Q Preservation maintains motion but loses character consistency, and No Q Intervention keeps characters consistent but compromises motion quality.

Ablation Study - ConsiStory components (Fig. 5 in paper)

Ablating ConsiStory components in video generation: Video Storyboarding best balances motion quality and identity consistency, with rich motion and identity preservation. VideoCrafter2 delivers diverse motion but struggles with consistent characters. ConsiS suffers from poor identity and motion issues. ConsiS +Uncond resolves motion artifacts but weakens motion magnitude and identity. Q ConsiS couples each frame with a single frame in an anchor video, allowing for some natural motion, with improved identity.

Dynamic video-models (Fig. 8 in paper)

Limitations such as video quality and motion issues diminish as our approach scales to stronger models. This is demonstrated using T2V-TURBO-V2 (Li et al., NeurIPS 2024), a recent state-of-the-art video model.

General subjects (Fig. 9 in paper)

The approach also applies to general (superclass) subject prompts like woman, or rabbit

Multiple subjects (Fig. 10 in paper)

Additional Qualitative Comparisons (Fig. 11 in paper)

User Study

User Study: (left) We measure user preferences for set consistency and (right) how well the generated motion matches the text prompt. Our approach achieves the superior set consistency score while maintaining competitive text-motion alignment. Notably, 55% of our generated motions were judged to be of similar or better quality compared to the vanilla model. Error bars are S.E.M.

BibTeX Citation

@article{atzmon2024multi,
    title = {
        Multi-Shot Character Consistency for Text-to-Video Generation
    },
    author = {
    Atzmon, Yuval and Gal, Rinon and Tewel, Yoad and Kasten, Yoni
    and Chechik, Gal
    },
    journal={arXiv preprint arXiv:2412.07750},
    year = {2024},
}

Video Storyboarding: Multi-Shot Character Consistency for Text-to-Video Generation