Introduction

Both text-to-image generation and large language models (LLMs) have made significant advancements. However, many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders. In this work, we investigate the effectiveness of using modern decoder-only LLMs as text encoders for text-to-image diffusion models.

Our experiments reveal that the de facto way of using last-layer embeddings as conditioning leads to inferior performance. Instead, we explore embeddings from various layers and find that using layer-normalized averaging across all layers significantly improves alignment with complex prompts. Most LLMs with this conditioning outperform the baseline T5 model, showing enhanced performance in advanced visio-linguistic reasoning skills.

CLIP (last layer)

CLIP-dog

T5 (last layer)

T5-dog

Mistral (norm avg)

Mistral-dog

Bge-Gemma2 (norm avg)

Bge-Gemma2-dog

A small dog not in a tiny sweater, playing joyfully without any clothes. The fluffy white dog, with big brown eyes and floppy ears, bounds through a sun-drenched field of wildflowers. Its tongue lolls out in pure happiness as it chases a bright red butterfly, its tiny paws barely touching the ground.

Key Findings

Using text embeddings from the last layer of the LLM is suboptimal

To our knowledge, all current text-to-image diffusion models utilize the embeddings from the final layer of the text encoders as the conditional embedding. However, our results reveal that this approach does not translate effectively to LLMs, leading to inferior results compared to using T5.

VQA last layer bar chart

Aggregating features from multiple layers outperforms using a single layer

We find that using embeddings normalized and averaged across all layers yields far better performance than relying on any single layer alone. This is because each layer within an LLM captures different aspects of linguistic information, so using averaged embeddings can combine the strengths of every layer to create a richer and more comprehensive representation.

Mistral layers
Different embeddings

Heatmap visualizations of token cross-attention show how the norm-average model outperforms the last-layer model on prompts requiring advanced visio-linguistic reasoning skills, such as differentiation (left set) and comparison (right set):

Mistral (last layer)

Gorilla mistral last

Mistral (norm avg)

Gorilla mistral norm

Mistral (last layer)

Tomato mistral last

Mistral (norm avg)

Tomato mistral norm

A larger gorilla hands a smaller, mechanical monkey a banana

A tomato vine with several tomatoes on it, all yellow except the largest which is red

Gorilla heatmap last
Gorilla heatmap norm
Tomato heatmap last
Tomato heatmap norm

Scaling up the LLM is beneficial, but not across all aspects

Increasing the model size of LLMs consistently leads to improved performance. However, we observe that model size does not uniformly enhance all aspects of compositional text-to-image generation. These results suggest that simply scaling model size may not be the most efficient approach for improving performance across all skills, highlighting the potential of alternative strategies, such as hybrid models or skill-specific fine-tuning.

Qwen2 scaling radar
Gemma2 scaling radar

We evaluate differently sized Qwen2 and Gemma2 models. The smaller models (orange) do not exhibit substantial differences in some aspects compared to the larger models (blue). Our results show that scaling does improves performance, but not uniformly across compositional skills.

Experimental Setup

Training Pipeline

Our training pipeline is designed to evaluate the performance of various text encoders within a standardized text-to-image diffusion model framework. Following the SD2 architecture, we retain and freeze all model components except for the UNet. In each experiment, we replace the text encoder with a different LLM or fine-tuned embedding model.

Training pipeline overview

Models of Interest

We mainly explore four types of text encoders:

Benchmarking and Metrics

We adopt GenAI-Bench as our primary benchmarking suite. GenAI-Bench includes 1,600 diverse and challenging prompts, each annotated with specific aspects in the compositional text-to-visual generation. We use VQAScore as our primary evaluation metric, as VQA-based automatic evaluation methods have demonstrated higher reliability and correlation with human judgements.

Visual Results

Additional visual comparisons on GenAI-Bench

CLIP (last layer)

CLIP-rose

T5 (last layer)

T5-rose

Mistral (norm avg)

Mistral-rose

Bge-Gemma2 (norm avg)

Bge-Gemma2-rose

A single, vibrant red rose blooms defiantly through a narrow crack in the weathered, grey concrete. Its velvety petals unfurl gracefully, reaching for the sunlight that filters weakly through the urban haze.

CLIP-balls
T5-balls
Mistral-balls
Bge-Gemma2-balls

A scene with two blue balls amidst many yellow ones. The blue balls are slightly larger than the yellow ones and have a smooth, glossy surface that reflects the light.

CLIP-box
T5-box
Mistral-box
Bge-Gemma2-box

A yellow felt box has no metallic blue spheres on the left side and has blue metallic spheres on the right side.

CLIP-fish
T5-fish
Mistral-fish
Bge-Gemma2-fish

There is a large fish aquarium in the center of the luxurious living room, but there are no fish in it. The aquarium is made of polished, rippling glass, reflecting the warm glow of the chandelier above.

CLIP-dogs
T5-dogs
Mistral-dogs
Bge-Gemma2-dogs

A woman with three dogs and no umbrella in the drizzle. Two golden retrievers bound ahead, their tails wagging despite the light rain, while a small terrier trots obediently by her side.

Additional visual comparisons on common text-to-image prompts

CLIP (last layer)

CLIP-bike

T5 (last layer)

T5-bike

Mistral (norm avg)

Mistral-bike

Bge-Gemma2 (norm avg)

Bge-Gemma2-bike

A photo of a Shiba Inu dog with a backpack riding a bike. It is wearing sunglasses and a beach hat.

CLIP-panda
T5-panda
Mistral-panda
Bge-Gemma2-panda

A high contrast portrait of a very happy fuzzy panda dressed as a chef in a high end kitchen making dough. There is a painting of flowers on the wall behind him.

CLIP-ferret
T5-ferret
Mistral-ferret
Bge-Gemma2-ferret

A mischievous ferret with a playful grin squeezes itself into a large glass jar, surrounded by colorful candy. The jar sits on a wooden table in a cozy kitchen, and warm sunlight filters through a nearby window.

CLIP-ice
T5-ice
Mistral-ice
Bge-Gemma2-ice

An icy landscape under a starlit sky, where a magnificent frozen waterfall flows over a cliff. In the center of the scene, a fire burns bright, its flames seemingly frozen in place, casting a shimmering glow on the surrounding ice and snow.

Citation

@inproceedings{wang2025decoder,
  title={A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation},
  author={Wang, Andrew Z. and Ge, Songwei and Karras, Tero and Liu, Ming-Yu and Balaji, Yogesh},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}