Key-Locked Rank One Editing for
Text-to-Image Personalization

1NVIDIA, 2Tel Aviv University
Accepted to SIGGRAPH 2023

We present Perfusion, a new text-to-image personalization method. With only a 100KB model size per concept (excluding the pretrained model, which is a few GBs), trained for roughly 4 minutes, Perfusion can creatively portray personalized objects. It allows significant changes in their appearance, while maintaining their identity, using a novel mechanism we call Key-Locking. Perfusion can also combine individually learned concepts into a single generated image. Finally, it enables controlling the trade-off between visual and textual alignment at inference time, covering the entire Pareto front with just a single trained model.

Teaser.

Perfusion can easily create appealing images.
Usually, just using 8 seeds can generate several good image samples.

Abstract

Text-to-image models (T2I) offer a new level of flexibility by allowing users to guide the creative process through natural language. However, personalizing these models to align with user-provided visual concepts remains a challenging problem. The task of T2I personalization poses multiple hard challenges, such as maintaining high visual fidelity while allowing creative control, combining multiple personalized concepts in a single image, and keeping a small model size. We present Perfusion, a T2I personalization method that addresses these challenges using dynamic rank-1 updates to the underlying T2I model. Perfusion avoids overfitting by introducing a new mechanism that “locks” new concepts’ cross-attention Keys to their superordinate category. Additionally, we develop a gated rank-1 approach that enables us to control the influence of a learned concept during inference time and to combine multiple concepts. This allows runtime-efficient balancing of visual-fidelity and textual-alignment with a single 100KB trained model, which is five orders of magnitude smaller than the current state of the art. Moreover, it can span different operating points across the Pareto front without additional training. Finally, we show that Perfusion outperforms strong baselines in both qualitative and quantitative terms. Importantly, key-locking leads to novel results compared to traditional approaches, allowing to portray personalized object interactions in unprecedented ways, even in one-shot settings.

How does it work?

Architecture outline (A): A prompt is transformed into a sequence of encodings. Each encoding is fed to a set of cross-attention modules (purple blocks) of a diffusion U-Net denoiser. Zoomed-in purple module shows how the Key and Value pathways are conditioned on the text encoding. The Key drives the attention map, which then modulates the Value pathway. Gated Rank-1 Edit (B): Top: The K pathway is locked so any encoding of 𝑒_Hugsy that reaches 𝑊𝑘 is mapped to the key of the supre-category 𝐾_teddy. Bottom: Any encoding of 𝑒_Hugsy that reaches 𝑊𝑣 , is mapped to 𝑉_Hugsy, which is learned. The gated aspect of this update allows to selectively apply it to only the necessary encodings and provides means for regulating the strength of learned concept, as expressed in the output images.

Comparison To Current Methods

Perfusion can enable more animate results, with better prompt-matching and less susceptibility to background traits from the original image. For each concept, we show exemplars from our training set, along with generated images, their conditioning texts and comparisons to Custom-Diffusion, Dreambooth and Textual-Inversion baselines.

Compositions

Our method enables us to combine multiple learned concept into a single generated image, using a textual prompt. The concepts are individually learned and merged only during the runtime process to produce the final image.
This results in a visually appealing display of concept interactions which we compare to Custom-Diffusion. Except for the teddy* prompt, all prompts are from Custom-Diffusion paper and use the images provided by the paper.


Efficiently Control the Visual-Textual Alignment

Our method enables to control the trade-off between visual-fidelity and the textual-alignment at inference time. A high bias value reduces the concept’s effect, while a low bias value makes it more influential. With just a single 100KB trained model and run-time parameter choices, Perfusion (blue and cyan) spans the Pareto front.




1-shot Personalization

When training with a single image, our method can generate images with both high visual-fidelity and textual- alignment.




Comparing Lock Types

We present 3 variations of key-locking:
Global key-locking allows for more visual variability and can accurately portray the nuances of an object or activity, like when depicting the cat in a human-like posture reading a book or wearing a chef outfit. Local key-locking also has successes, but they are not as effective as global key-locking. Finally, Trained-K has better compatibility with the training images, but it sacrifices its alignment with the text.




Zero-shot Transfer To Fine-tuned Models

A Perfusion concept trained using a vanilla diffusion-model can generalize to fine-tuned variants.



BibTeX

If you find our work useful, please cite our paper:

@inproceedings{tewel2023keylocked,
      author = {Tewel, Yoad and Gal, Rinon and Chechik, Gal and Atzmon, Yuval},
      title = {Key-Locked Rank One Editing for Text-to-Image Personalization},
      year = {2023},
      booktitle = {ACM SIGGRAPH 2023 Conference Proceedings},
      location = {Los Angeles, CA, USA},
      series = {SIGGRAPH '23}
}