APE:
Agentic Prompt Enhancer for
Image Generation and Editing

¹ NVIDIA ² University of Michigan

TL;DR: APE post-trains small language models (SLMs) as prompt-enhancement agents for image generation and editing. It supports a single-pass SAPE rewriter and a structured router–rewriter–composer multi-agent pipeline MAPE, narrowing the gap to closed-source prompt enhancers without modifying the downstream visual model.

Prompt Matters

Text-guided image generators are highly sensitive to prompt formulation. Short, ambiguous prompts routinely produce wrong object counts, broken spatial relations, or out-of-distribution attributes. APE rewrites the user instruction into a richer, more visually-realizable prompt. Each card shows the image from the original prompt (left) vs. the enhanced prompt (right).

Image model: Z-Image-turbo · Prompt enhancer: Qwen3-4B after GRPO.

Abstract

Natural language has become a powerful interface for image generation and editing, yet text-guided visual systems remain highly sensitive to prompt formulation. Semantically similar requests can produce different outputs depending on wording, specificity, and how explicitly visual constraints are stated, motivating prompt enhancement as a trainable component rather than a peripheral user choice. Existing strong enhancers often rely on large, proprietary LLMs such as ChatGPT or Gemini, adding cost, latency, and deployment dependence to the visual generation pipeline. We propose Agentic Prompt Enhancer (APE), a lightweight framework that post-trains small language models (SLMs) as prompt-enhancement agents. APE supports both single-agent rewriting and role-specialized multi-agent enhancement. Its single-agent instantiation, SAPE, rewrites the prompt in one pass, while its multi-agent instantiation, MAPE, decomposes enhancement into a router–rewriter–composer process for handling compositional constraints over objects, attributes, spatial relations, and edits. With task-aware rewards and post-training protocols, APE improves visual alignment and prompt following without modifying the downstream visual model. Experiments on challenging image generation and editing benchmarks demonstrate that post-trained small prompt enhancers reliably outperform their base counterparts, narrowing the gap to closed-source prompt enhancers; in addition, MAPE proves particularly strong on complex compositional tasks within these benchmarks.

Method

APE rewrites a user instruction (and an optional source image) into a more effective prompt. SAPE performs a single-pass rewrite; MAPE decomposes it into a router–rewriter–composer pipeline over semantic fields. The image model stays frozen — only APE is post-trained with GRPO / GDPO.

SAPE

A single small LLM rewrites the prompt in one pass. RL on this lightweight enhancer alone, with a frozen image model, already yields substantial gains in visual alignment.

MAPE

A router–rewriter–composer pipeline over predefined semantic fields (subject, appearance, background, composition, lighting, style, edit operation, locality …). Stronger inductive bias for compositional and constraint-heavy prompts.

GRPO + GDPO

Reward signals are computed from downstream images, not reference rewrites. GRPO handles scalar rewards; GDPO decouples normalization across reward dimensions before aggregation for stable multi-reward training.

Image Generation

SAPE

We post-train Qwen3-0.6B / 1.7B as SAPE on Pick-a-Pic with GRPO and evaluate on DrawBench. The downstream image model (Z-Image-turbo) is kept frozen throughout. Below we visualize image pairs generated before (left) vs. after (right) GRPO post-training of the SAPE prompt enhancer.

DrawBench

GRPO uses PickScore + CLIPScore + HPSv2.1 during training on Pick-a-Pic. Higher is better. Bold indicates the better of each base/SAPE pair.

Prompt Enhancer	PickScore	CLIPScore	HPSv2.1	Aesthetic	ImgRwd	UniRwd
None	23.05	0.2831	0.2998	5.367	1.062	3.416
Qwen3-0.6B (one-shot)	22.23	0.2524	0.2801	5.623	0.604	3.068
SAPE (Qwen3-0.6B)	23.18	0.2814	0.3128	5.481	1.192	3.472
Qwen3-1.7B (one-shot)	22.91	0.2718	0.3015	5.564	0.930	3.374
SAPE (Qwen3-1.7B)	23.05	0.2768	0.3051	5.495	1.043	3.480

MAPE

For harder, compositional prompts, single-pass rewriting is insufficient. MAPE decomposes enhancement into a router that selects relevant semantic fields, specialized rewriters that refine each selected field, and a composer that assembles the final natural-language prompt. We instantiate the language-side enhancer with Qwen3-1.7B and Qwen3-4B and evaluate on UniGenBench across multiple image generators (Qwen-Image-2512, Z-Image-turbo, FLUX.2-klein-4B/9B). Below, each card compares a one-shot Qwen3-4B prompt (left) against our trained MAPE (Qwen3-4B, right) on UniGenBench prompts.

UniGenBench

MAPE compared with the un-enhanced baseline, one-shot SLMs, and the strong closed-source baseline Gemini-3.1-Pro with our router–rewriter–composer prompting (MSP).

T2I Model	Prompt Enhancer	UniGen Short	UniGen Long
Qwen-Image-2512	None	0.7493	0.8869
	Gemini-3.1-Pro (MSP)	0.8669	0.8758
	MAPE (Qwen3-1.7B)	0.8334	0.8624
	MAPE (Qwen3-4B)	0.8539	0.8923
Z-Image-turbo	None	0.6931	0.8170
	Gemini-3.1-Pro (MSP)	0.7501	0.7674
	MAPE (Qwen3-1.7B)	0.7716	0.8405
	MAPE (Qwen3-4B)	0.8356	0.8512
FLUX.2-klein-4B	None	0.7489	0.8464
	Gemini-3.1-Pro (MSP)	0.8263	0.8392
	MAPE (Qwen3-1.7B)	0.7710	0.8275
	MAPE (Qwen3-4B)	0.8042	0.8539
FLUX.2-klein-9B	None	0.8058	0.8792
	Gemini-3.1-Pro (MSP)	0.8553	0.8704
	MAPE (Qwen3-1.7B)	0.8268	0.8578
	MAPE (Qwen3-4B)	0.8460	0.8641

Image Editing

MAPE

MAPE generalizes beyond generation. For editing, the router may choose not to rewrite short, local instructions, while triggering multi-agent enhancement for tasks such as Extract, Style, or complex composition that benefit from richer grounding and explicit preservation constraints. Click a thumbnail below to inspect that example as a triplet: source image, the result from a one-shot Qwen3-VL-4B prompt, and the result from our trained MAPE.

ImgEdit

Image-editing performance across editing categories. MAPE consistently improves over the one-shot Qwen3-VL-4B baseline.

I2I Model	Prompt Enhancer	Overall	Extract	Style	Adjust	Action	Compose
FLUX.2-klein-4B	None	3.85	1.94	4.91	4.24	4.69	2.71
	Qwen3-VL-4B (one-shot)	3.87	1.91	4.67	4.17	4.41	2.91
	MAPE (Qwen3-VL-4B)	4.15	3.58	4.81	4.25	4.57	2.77
FLUX.2-klein-9B	None	4.07	2.23	4.94	4.22	4.32	3.14
	Qwen3-VL-4B (one-shot)	4.03	1.98	4.88	4.27	4.41	2.87
	MAPE (Qwen3-VL-4B)	4.32	4.01	4.84	4.21	4.68	3.00
Qwen-Image-Edit	None	3.98	4.04	4.56	3.66	3.90	2.64
	Qwen3-VL-4B (one-shot)	3.93	3.50	4.65	3.85	3.84	2.71
	MAPE (Qwen3-VL-4B)	4.01	3.83	4.58	3.80	4.14	2.64

Citation

@article{ape2026,
    title={{APE}: Agentic Prompt Enhancer for Image Generation and Editing},
    author={Huang, Zijian and Wu, Jay Zhangjie and Wang, Zian and Cao, Tianshi and Chen, Jiasi and Fidler, Sanja and Ling, Huan and Ren, Xuanchi},
    journal={arXiv preprint},
    year={2026}
}

APE: Agentic Prompt Enhancer for Image Generation and Editing

Prompt Matters

Abstract

Method

SAPE

MAPE

GRPO + GDPO

Image Generation

SAPE

DrawBench

MAPE

UniGenBench

Image Editing

MAPE

ImgEdit

Citation

APE:
Agentic Prompt Enhancer for
Image Generation and Editing