Key-Locked Rank One Editing for Text-to-Image Personalization

Text-to-image models (T2I) offer a new level of flexibility by allowing users to guide the creative process through natural language. However, personalizing these models to align with user-provided visual concepts remains a challenging problem. The task of T2I personalization poses multiple hard challenges, such as maintaining high visual fidelity while allowing creative control, combining multiple personalized concepts in a single image, and keeping a small model size. We present Perfusion, a T2I personalization method that addresses these challenges using dynamic rank-1 updates to the underlying T2I model. Perfusion avoids overfitting by introducing a new mechanism that “locks” new concepts’ cross-attention Keys to their superordinate category. Additionally, we develop a gated rank-1 approach that enables us to control the influence of a learned concept during inference time and to combine multiple concepts. This allows runtime-efficient balancing of visual-fidelity and textual-alignment with a single 100KB trained model, which is five orders of magnitude smaller than the current state of the art. Moreover, it can span different operating points across the Pareto front without additional training. Finally, we show that Perfusion outperforms strong baselines in both qualitative and quantitative terms. Importantly, key-locking leads to novel results compared to traditional approaches, allowing to portray personalized object interactions in unprecedented ways, even in one-shot settings.

Perfusion can enable more animate results, with better prompt-matching and less susceptibility to background traits from the original image. For each concept, we show exemplars from our training set, along with generated images, their conditioning texts and comparisons to Custom-Diffusion, Dreambooth and Textual-Inversion baselines.

Our method enables us to combine multiple learned concept into a single generated image, using a textual prompt. The concepts are individually learned and merged only during the runtime process to produce the final image.
This results in a visually appealing display of concept interactions which we compare to Custom-Diffusion. Except for the teddy* prompt, all prompts are from Custom-Diffusion paper and use the images provided by the paper.

Our method enables to control the trade-off between visual-fidelity and the textual-alignment at inference time. A high bias value reduces the concept’s effect, while a low bias value makes it more influential. With just a single 100KB trained model and run-time parameter choices, Perfusion (blue and cyan) spans the Pareto front.

When training with a single image, our method can generate images with both high visual-fidelity and textual- alignment.

We present 3 variations of key-locking:
Global key-locking allows for more visual variability and can accurately portray the nuances of an object or activity, like when depicting the cat in a human-like posture reading a book or wearing a chef outfit. Local key-locking also has successes, but they are not as effective as global key-locking. Finally, Trained-K has better compatibility with the training images, but it sacrifices its alignment with the text.

A Perfusion concept trained using a vanilla diffusion-model can generalize to fine-tuned variants.

BibTeX

If you find our work useful, please cite our paper:

@inproceedings{tewel2023keylocked,
      author = {Tewel, Yoad and Gal, Rinon and Chechik, Gal and Atzmon, Yuval},
      title = {Key-Locked Rank One Editing for Text-to-Image Personalization},
      year = {2023},
      booktitle = {ACM SIGGRAPH 2023 Conference Proceedings},
      location = {Los Angeles, CA, USA},
      series = {SIGGRAPH '23}
}

Key-Locked Rank One Editing for
Text-to-Image Personalization

Perfusion can easily create appealing images.
Usually, just using 8 seeds can generate several good image samples.

Abstract

How does it work?

Comparison To Current Methods

Compositions

Efficiently Control the Visual-Textual Alignment

1-shot Personalization

Comparing Lock Types

Zero-shot Transfer To Fine-tuned Models

BibTeX

Key-Locked Rank One Editing for Text-to-Image Personalization

Perfusion can easily create appealing images. Usually, just using 8 seeds can generate several good image samples.

Abstract

How does it work?

Comparison To Current Methods

Compositions

Efficiently Control the Visual-Textual Alignment

1-shot Personalization

Comparing Lock Types

Zero-shot Transfer To Fine-tuned Models

BibTeX

Key-Locked Rank One Editing for
Text-to-Image Personalization

Perfusion can easily create appealing images.
Usually, just using 8 seeds can generate several good image samples.