ConsiStory

Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

We evaluated our method against IP-Adapter, TI, and DB-LORA. Some methods failed to maintain consistency (TI), or follow the prompt (IP-Adapter). Other methods alternated between keeping consistency or following text, but not both (DB-LoRA). Our method successfully followed the prompt while maintaining consistency

Automatic Evaluation (left): ConsiStory (green) achieves the optimal balance between Subject Consistency and Textual Similarity. Encoder-based methods such as ELITE and IP-Adapter often overfit to visual appearance, while optimization-based methods such as LoRA-DB and TI do not exhibit high subject consistency as in our method. 𝑑 denotes different self-attention dropout values. Error bars are S.E.M.
User Study (right): results indicate a notable preference among participants for our generated images both in regards to Subject Consistency (Visual) and Textual Similarity (Textual).

Our method can be integrated with ControlNet to generate a consistent character with pose control.

We utilize edit-friendly inversion to invert 2 real images per subject. These inverted images are used as anchors in our method for training-free personalization.

Given different starting noise, ConsiStory generates different consistent set of images.

The underlying SDXL model may exhibit biases towards certain ethnic groups, and our approach inherits them. Our method can generate consistent subjects belonging to diverse groups when these are provided in the prompt.

BibTeX

If you find our work useful, please cite our paper:

@misc{tewel2024consistory,
        title={Training-Free Consistent Text-to-Image Generation}, 
        author={Yoad Tewel and Omri Kaduri and Rinon Gal and Yoni Kasten and Lior Wolf and Gal Chechik and Yuval Atzmon},
        year={2024},
        eprint={2402.03286},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2402.03286}, 
  }

ConsiStory: Training-Free Consistent
Text-to-Image Generation

TLDR: We enable Stable Diffusion XL (SDXL) to generate consistent subjects across a series of images, without additional training.

Consistent Set Generations

Abstract

How does it work?

Comparison To Current Methods

Quantitative Evaluation

Multiple Consistent Subjects

ConsiStory can generate image sets with multiple consistent subjects.

ControlNet Integration

Training-Free Personalization

Seed Variation

Ethnic Diversity

BibTeX

ConsiStory: Training-Free Consistent Text-to-Image Generation

TLDR: We enable Stable Diffusion XL (SDXL) to generate consistent subjects across a series of images, without additional training.

Consistent Set Generations

Abstract

How does it work?

Comparison To Current Methods

Quantitative Evaluation

Multiple Consistent Subjects

ConsiStory can generate image sets with multiple consistent subjects.

ControlNet Integration

Training-Free Personalization

Seed Variation

Ethnic Diversity

BibTeX

ConsiStory: Training-Free Consistent
Text-to-Image Generation