APE:
Agentic Prompt Enhancer for
Image Generation and Editing
TL;DR: APE post-trains small language models (SLMs) as prompt-enhancement agents for image generation and editing. It supports a single-pass SAPE rewriter and a structured router–rewriter–composer multi-agent pipeline MAPE, narrowing the gap to closed-source prompt enhancers without modifying the downstream visual model.
Abstract
Natural language has become a powerful interface for image generation and editing, yet text-guided visual systems remain highly sensitive to prompt formulation. Semantically similar requests can produce different outputs depending on wording, specificity, and how explicitly visual constraints are stated, motivating prompt enhancement as a trainable component rather than a peripheral user choice. Existing strong enhancers often rely on large, proprietary LLMs such as ChatGPT or Gemini, adding cost, latency, and deployment dependence to the visual generation pipeline. We propose Agentic Prompt Enhancer (APE), a lightweight framework that post-trains small language models (SLMs) as prompt-enhancement agents. APE supports both single-agent rewriting and role-specialized multi-agent enhancement. Its single-agent instantiation, SAPE, rewrites the prompt in one pass, while its multi-agent instantiation, MAPE, decomposes enhancement into a router–rewriter–composer process for handling compositional constraints over objects, attributes, spatial relations, and edits. With task-aware rewards and post-training protocols, APE improves visual alignment and prompt following without modifying the downstream visual model. Experiments on challenging image generation and editing benchmarks demonstrate that post-trained small prompt enhancers reliably outperform their base counterparts, narrowing the gap to closed-source prompt enhancers; in addition, MAPE proves particularly strong on complex compositional tasks within these benchmarks.
Method
APE rewrites a user instruction (and an optional source image) into a more effective prompt. SAPE performs a single-pass rewrite; MAPE decomposes it into a router–rewriter–composer pipeline over semantic fields. The image model stays frozen — only APE is post-trained with GRPO / GDPO.
SAPE
A single small LLM rewrites the prompt in one pass. RL on this lightweight enhancer alone, with a frozen image model, already yields substantial gains in visual alignment.
MAPE
A router–rewriter–composer pipeline over predefined semantic fields (subject, appearance, background, composition, lighting, style, edit operation, locality …). Stronger inductive bias for compositional and constraint-heavy prompts.
GRPO + GDPO
Reward signals are computed from downstream images, not reference rewrites. GRPO handles scalar rewards; GDPO decouples normalization across reward dimensions before aggregation for stable multi-reward training.
Image Generation
SAPE
We post-train Qwen3-0.6B / 1.7B as SAPE on Pick-a-Pic with GRPO and evaluate on DrawBench. The downstream image model (Z-Image-turbo) is kept frozen throughout. Below we visualize image pairs generated before (left) vs. after (right) GRPO post-training of the SAPE prompt enhancer.
DrawBench
GRPO uses PickScore + CLIPScore + HPSv2.1 during training on Pick-a-Pic. Higher is better. Bold indicates the better of each base/SAPE pair.
| Prompt Enhancer | PickScore | CLIPScore | HPSv2.1 | Aesthetic | ImgRwd | UniRwd |
|---|---|---|---|---|---|---|
| None | 23.05 | 0.2831 | 0.2998 | 5.367 | 1.062 | 3.416 |
| Qwen3-0.6B (one-shot) | 22.23 | 0.2524 | 0.2801 | 5.623 | 0.604 | 3.068 |
| SAPE (Qwen3-0.6B) | 23.18 | 0.2814 | 0.3128 | 5.481 | 1.192 | 3.472 |
| Qwen3-1.7B (one-shot) | 22.91 | 0.2718 | 0.3015 | 5.564 | 0.930 | 3.374 |
| SAPE (Qwen3-1.7B) | 23.05 | 0.2768 | 0.3051 | 5.495 | 1.043 | 3.480 |
MAPE
For harder, compositional prompts, single-pass rewriting is insufficient. MAPE decomposes enhancement into a router that selects relevant semantic fields, specialized rewriters that refine each selected field, and a composer that assembles the final natural-language prompt. We instantiate the language-side enhancer with Qwen3-1.7B and Qwen3-4B and evaluate on UniGenBench across multiple image generators (Qwen-Image-2512, Z-Image-turbo, FLUX.2-klein-4B/9B). Below, each card compares a one-shot Qwen3-4B prompt (left) against our trained MAPE (Qwen3-4B, right) on UniGenBench prompts.
UniGenBench
MAPE compared with the un-enhanced baseline, one-shot SLMs, and the strong closed-source baseline Gemini-3.1-Pro with our router–rewriter–composer prompting (MSP).
| T2I Model | Prompt Enhancer | UniGen Short | UniGen Long |
|---|---|---|---|
| Qwen-Image-2512 | None | 0.7493 | 0.8869 |
| Gemini-3.1-Pro (MSP) | 0.8669 | 0.8758 | |
| MAPE (Qwen3-1.7B) | 0.8334 | 0.8624 | |
| MAPE (Qwen3-4B) | 0.8539 | 0.8923 | |
| Z-Image-turbo | None | 0.6931 | 0.8170 |
| Gemini-3.1-Pro (MSP) | 0.7501 | 0.7674 | |
| MAPE (Qwen3-1.7B) | 0.7716 | 0.8405 | |
| MAPE (Qwen3-4B) | 0.8356 | 0.8512 | |
| FLUX.2-klein-4B | None | 0.7489 | 0.8464 |
| Gemini-3.1-Pro (MSP) | 0.8263 | 0.8392 | |
| MAPE (Qwen3-1.7B) | 0.7710 | 0.8275 | |
| MAPE (Qwen3-4B) | 0.8042 | 0.8539 | |
| FLUX.2-klein-9B | None | 0.8058 | 0.8792 |
| Gemini-3.1-Pro (MSP) | 0.8553 | 0.8704 | |
| MAPE (Qwen3-1.7B) | 0.8268 | 0.8578 | |
| MAPE (Qwen3-4B) | 0.8460 | 0.8641 |
Image Editing
MAPE
MAPE generalizes beyond generation. For editing, the router may choose not to rewrite short, local instructions, while triggering multi-agent enhancement for tasks such as Extract, Style, or complex composition that benefit from richer grounding and explicit preservation constraints. Click a thumbnail below to inspect that example as a triplet: source image, the result from a one-shot Qwen3-VL-4B prompt, and the result from our trained MAPE.
ImgEdit
Image-editing performance across editing categories. MAPE consistently improves over the one-shot Qwen3-VL-4B baseline.
| I2I Model | Prompt Enhancer | Overall | Extract | Style | Adjust | Action | Compose |
|---|---|---|---|---|---|---|---|
| FLUX.2-klein-4B | None | 3.85 | 1.94 | 4.91 | 4.24 | 4.69 | 2.71 |
| Qwen3-VL-4B (one-shot) | 3.87 | 1.91 | 4.67 | 4.17 | 4.41 | 2.91 | |
| MAPE (Qwen3-VL-4B) | 4.15 | 3.58 | 4.81 | 4.25 | 4.57 | 2.77 | |
| FLUX.2-klein-9B | None | 4.07 | 2.23 | 4.94 | 4.22 | 4.32 | 3.14 |
| Qwen3-VL-4B (one-shot) | 4.03 | 1.98 | 4.88 | 4.27 | 4.41 | 2.87 | |
| MAPE (Qwen3-VL-4B) | 4.32 | 4.01 | 4.84 | 4.21 | 4.68 | 3.00 | |
| Qwen-Image-Edit | None | 3.98 | 4.04 | 4.56 | 3.66 | 3.90 | 2.64 |
| Qwen3-VL-4B (one-shot) | 3.93 | 3.50 | 4.65 | 3.85 | 3.84 | 2.71 | |
| MAPE (Qwen3-VL-4B) | 4.01 | 3.83 | 4.58 | 3.80 | 4.14 | 2.64 |
Citation
@article{ape2026,
title={{APE}: Agentic Prompt Enhancer for Image Generation and Editing},
author={Huang, Zijian and Wu, Jay Zhangjie and Wang, Zian and Cao, Tianshi and Chen, Jiasi and Fidler, Sanja and Ling, Huan and Ren, Xuanchi},
journal={arXiv preprint},
year={2026}
}