Text-to-video models have made significant strides in generating short video clips from textual descriptions. Yet, a significant challenge remains: generating several video shots of the same characters, preserving their identity without hurting video quality, dynamics, and responsiveness to text prompts. We present Video Storyboarding, a training-free method to enable pretrained text-to-video models to generate multiple shots with consistent characters, by sharing features between them. Our key insight is that self-attention query features (Q) encode both motion and identity. This creates a hard-to-avoid trade-off between preserving character identity and making videos dynamic, when features are shared. To address this issue, we introduce a novel query injection strategy that balances identity preservation and natural motion retention. This approach improves upon naive consistency techniques applied to videos, which often struggle to maintain this delicate equilibrium. Our experiments demonstrate significant improvements in character consistency across scenes while maintaining high-quality motion and text alignment. These results offer insights into critical stages of video generation and the interplay of structure and motion in video diffusion models.
Architecture outline: Our consistent denoising process has two phases: Q Preservation and Q Flow. We first generate and cache video shots using ``vanilla'' VideoCrafter2. In Q Preservation (T → tpres), we use Vanilla Q Injection to maintain motion structure by replacing our Q values with vanilla ones. In Q Flow (tpres → t0), we use a flow map from vanilla key frames to guide Q feature injection. This phase maintains character identity by allowing the use of Q features from our consistent denoising process, while the flow map ensures that these identity-preserving features are applied in a way that's consistent with the original motion. Throughout, we employ two complementary techniques: framewise subject-driven self-attention for visual coherence, and refinement feature injection to reinforce character consistency across diverse prompts.
Video Storyboarding (top) demonstrates improved character while maintaining natural motion. VideoCrafter2 shows diverse motion but inconsistent characters, Tokenflow-Encoder struggles with consistency and has color artifacts, ConsiS Im2Vid lacks both consistency and motion fidelity, and VSTAR shows limited prompt adherence despite good identity preservation.
Different Q injection strategies show distinct trade-offs: Video Storyboarding (top row) balances character consistency and natural motion. VideoCrafter2 offers diverse motion but loses character consistency, Full Q Preservation maintains motion but loses character consistency, and No Q Intervention keeps characters consistent but compromises motion quality.
Ablating ConsiStory components in video generation: Video Storyboarding best balances motion quality and identity consistency, with rich motion and identity preservation. VideoCrafter2 delivers diverse motion but struggles with consistent characters. ConsiS suffers from poor identity and motion issues. ConsiS +Uncond resolves motion artifacts but weakens motion magnitude and identity. Q ConsiS couples each frame with a single frame in an anchor video, allowing for some natural motion, with improved identity.
Limitations such as video quality and motion issues diminish as our approach scales to stronger models. This is demonstrated using T2V-TURBO-V2 (Li et al., NeurIPS 2024), a recent state-of-the-art video model.
The approach also applies to general (superclass) subject prompts like woman, or rabbit
User Study: (left) We measure user preferences for set consistency and (right) how well the generated motion matches the text prompt. Our approach achieves the superior set consistency score while maintaining competitive text-motion alignment. Notably, 55% of our generated motions were judged to be of similar or better quality compared to the vanilla model. Error bars are S.E.M.
@article{atzmon2024multi,
title = {
Multi-Shot Character Consistency for Text-to-Video Generation
},
author = {
Atzmon, Yuval and Gal, Rinon and Tewel, Yoad and Kasten, Yoni
and Chechik, Gal
},
journal={arXiv preprint arXiv:2412.07750},
year = {2024},
}