Joint-image Diffusion

J e D i

Joint-image Diffusion Models
for Finetuning-free Personalized Text-to-image Generation
CVPR 2024

Yu Zeng Vishal M. Patel Haocheng Wang^🦅 Xun Huang
Ting-chun Wang Ming-Yu Liu Yogesh Balaji

Johns Hopkins University >NVIDIA Corporation ^🦅Illinois Tech

Preprint.

We propose Joint-Image Diffusion, an effective technique for learning a finetuning-free personalization model. Our key idea is to learn the joint distribution of multiple related text-image pairs that share a common subject. To facilitate learning, we propose a scalable synthetic dataset generation technique. Once trained, our model enables fast and easy personalization at test time by simply using reference images as input during the sampling process. Our approach does not require any expensive optimization process or additional modules and can faithfully preserve the identity represented by any number of reference images. Experimental results show that our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming the prior finetuning-free personalization baselines.

For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing.

Method

We create a synthetic dataset of same-subject images using LLMs and pretrained text-to-image diffusion models. A joint-image diffusion model is trained on this dataset that learns to denoise multiple same-subject images together. At inference, personalized generation is performed in an inpainting fashion where the goal is to generate the missing images of a joint-image set.

The dataset are created by instructing text-to-image diffusion models to generate photo collages, followed by filtering and background inpainting.

In the joint-image diffusion model, images of the same subjects are grouped into an image set. The self-attention layers are modified so that each image co-attends to every other image in the same set. After that, each individual image attends to their respective text embedding in cross-attention layers.

Retrieval-augmented synthesis for generating new concepts

Combined with image retrieval, JeDi enables to generate unseen new concepts at test time.

	SDXL	Retrieval+JeDi
Keyword: Winton dog A cartoon dog wearing spacesuit in outerspace
Keyword: moncler narmada jacket A pigeon wearing a jacket.

Results

Comparison to finetuning-based methods

Comparison to finetuning-free methods

Quantitative comparison on DreamBooth test set

	Method	CLIP-T	CLIP-I	MCLIP-I	DINO	MDINO
Finetuning-based	DreamBooth	0.2812	0.8135	0.8683	0.6341	0.7115
Finetuning-based	Custom Diffusion	0.3015	0.7952	0.8640	0.6343	0.7109
Finetuning-free	BLIP Diffusion	0.2934	0.7899	0.8620	0.5855	0.6692
	ELITE	0.2961	0.7924	0.8615	0.5922	0.6805
	JeDi (1 input)	0.3040	0.7818	0.8764	0.6190	0.7510
	JeDi (3 inputs)	0.2932	0.8139	0.9011	0.6791	0.8037

Quantitative comparison on single-input personalization

Method	CLIP-T	CLIP-I	MCLIP-I	DINO	MDINO
BLIP Diffusion	0.2851	0.8107	0.8234	0.6091	0.6018
ELITE	0.2193	0.6082	0.6430	0.1862	0.2156
JeDi	0.2856	0.8697	0.8838	0.7934	0.7926

Introducing retrieval-augmented generation based on JeDi

Citation

            @inproceedings{zeng2024jedi,
                title={JeDi: Joint-image Diffusion Models for Finetuning-free Personalized Text-to-image Generation},
                author={Zeng, Yu and Patel, Vishal M and Wang, Haochen and Huang, Xun and Wang, Ting-Chun and Liu, Ming-Yu and Balaji, Yogesh},
                booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
                year={2024}
              }