APE:
Agentic Prompt Enhancer for
Image Generation and Editing

1 NVIDIA     2 University of Michigan

TL;DR: APE post-trains small language models (SLMs) as prompt-enhancement agents for image generation and editing. It supports a single-pass SAPE rewriter and a structured router–rewriter–composer multi-agent pipeline MAPE, narrowing the gap to closed-source prompt enhancers without modifying the downstream visual model.

Prompt Matters


Text-guided image generators are highly sensitive to prompt formulation. Short, ambiguous prompts routinely produce wrong object counts, broken spatial relations, or out-of-distribution attributes. APE rewrites the user instruction into a richer, more visually-realizable prompt. Each card shows the image from the original prompt (left) vs. the enhanced prompt (right).

Image model: Z-Image-turbo  ·  Prompt enhancer: Qwen3-4B after GRPO.

Abstract


Natural language has become a powerful interface for image generation and editing, yet text-guided visual systems remain highly sensitive to prompt formulation. Semantically similar requests can produce different outputs depending on wording, specificity, and how explicitly visual constraints are stated, motivating prompt enhancement as a trainable component rather than a peripheral user choice. Existing strong enhancers often rely on large, proprietary LLMs such as ChatGPT or Gemini, adding cost, latency, and deployment dependence to the visual generation pipeline. We propose Agentic Prompt Enhancer (APE), a lightweight framework that post-trains small language models (SLMs) as prompt-enhancement agents. APE supports both single-agent rewriting and role-specialized multi-agent enhancement. Its single-agent instantiation, SAPE, rewrites the prompt in one pass, while its multi-agent instantiation, MAPE, decomposes enhancement into a router–rewriter–composer process for handling compositional constraints over objects, attributes, spatial relations, and edits. With task-aware rewards and post-training protocols, APE improves visual alignment and prompt following without modifying the downstream visual model. Experiments on challenging image generation and editing benchmarks demonstrate that post-trained small prompt enhancers reliably outperform their base counterparts, narrowing the gap to closed-source prompt enhancers; in addition, MAPE proves particularly strong on complex compositional tasks within these benchmarks.

Method


APE rewrites a user instruction (and an optional source image) into a more effective prompt. SAPE performs a single-pass rewrite; MAPE decomposes it into a router–rewriter–composer pipeline over semantic fields. The image model stays frozen — only APE is post-trained with GRPO / GDPO.

SAPE

A single small LLM rewrites the prompt in one pass. RL on this lightweight enhancer alone, with a frozen image model, already yields substantial gains in visual alignment.

MAPE

A router–rewriter–composer pipeline over predefined semantic fields (subject, appearance, background, composition, lighting, style, edit operation, locality …). Stronger inductive bias for compositional and constraint-heavy prompts.

GRPO + GDPO

Reward signals are computed from downstream images, not reference rewrites. GRPO handles scalar rewards; GDPO decouples normalization across reward dimensions before aggregation for stable multi-reward training.

Image Generation


SAPE

We post-train Qwen3-0.6B / 1.7B as SAPE on Pick-a-Pic with GRPO and evaluate on DrawBench. The downstream image model (Z-Image-turbo) is kept frozen throughout. Below we visualize image pairs generated before (left) vs. after (right) GRPO post-training of the SAPE prompt enhancer.

DrawBench

GRPO uses PickScore + CLIPScore + HPSv2.1 during training on Pick-a-Pic. Higher is better. Bold indicates the better of each base/SAPE pair.

Prompt Enhancer PickScore CLIPScore HPSv2.1 Aesthetic ImgRwd UniRwd
None 23.050.28310.2998 5.3671.0623.416
Qwen3-0.6B (one-shot) 22.230.25240.2801 5.6230.6043.068
SAPE (Qwen3-0.6B) 23.180.28140.3128 5.4811.1923.472
Qwen3-1.7B (one-shot) 22.910.27180.3015 5.5640.9303.374
SAPE (Qwen3-1.7B) 23.050.27680.3051 5.4951.0433.480

MAPE

For harder, compositional prompts, single-pass rewriting is insufficient. MAPE decomposes enhancement into a router that selects relevant semantic fields, specialized rewriters that refine each selected field, and a composer that assembles the final natural-language prompt. We instantiate the language-side enhancer with Qwen3-1.7B and Qwen3-4B and evaluate on UniGenBench across multiple image generators (Qwen-Image-2512, Z-Image-turbo, FLUX.2-klein-4B/9B). Below, each card compares a one-shot Qwen3-4B prompt (left) against our trained MAPE (Qwen3-4B, right) on UniGenBench prompts.

UniGenBench

MAPE compared with the un-enhanced baseline, one-shot SLMs, and the strong closed-source baseline Gemini-3.1-Pro with our router–rewriter–composer prompting (MSP).

T2I Model Prompt Enhancer UniGen Short UniGen Long
Qwen-Image-2512None0.74930.8869
Gemini-3.1-Pro (MSP)0.86690.8758
MAPE (Qwen3-1.7B)0.83340.8624
MAPE (Qwen3-4B)0.85390.8923
Z-Image-turboNone0.69310.8170
Gemini-3.1-Pro (MSP)0.75010.7674
MAPE (Qwen3-1.7B)0.77160.8405
MAPE (Qwen3-4B)0.83560.8512
FLUX.2-klein-4BNone0.74890.8464
Gemini-3.1-Pro (MSP)0.82630.8392
MAPE (Qwen3-1.7B)0.77100.8275
MAPE (Qwen3-4B)0.80420.8539
FLUX.2-klein-9BNone0.80580.8792
Gemini-3.1-Pro (MSP)0.85530.8704
MAPE (Qwen3-1.7B)0.82680.8578
MAPE (Qwen3-4B)0.84600.8641

Image Editing


MAPE

MAPE generalizes beyond generation. For editing, the router may choose not to rewrite short, local instructions, while triggering multi-agent enhancement for tasks such as Extract, Style, or complex composition that benefit from richer grounding and explicit preservation constraints. Click a thumbnail below to inspect that example as a triplet: source image, the result from a one-shot Qwen3-VL-4B prompt, and the result from our trained MAPE.

ImgEdit

Image-editing performance across editing categories. MAPE consistently improves over the one-shot Qwen3-VL-4B baseline.

I2I Model Prompt Enhancer Overall Extract Style Adjust Action Compose
FLUX.2-klein-4BNone3.851.944.914.244.692.71
Qwen3-VL-4B (one-shot)3.871.914.674.174.412.91
MAPE (Qwen3-VL-4B)4.153.584.814.254.572.77
FLUX.2-klein-9BNone4.072.234.944.224.323.14
Qwen3-VL-4B (one-shot)4.031.984.884.274.412.87
MAPE (Qwen3-VL-4B)4.324.014.844.214.683.00
Qwen-Image-EditNone3.984.044.563.663.902.64
Qwen3-VL-4B (one-shot)3.933.504.653.853.842.71
MAPE (Qwen3-VL-4B)4.013.834.583.804.142.64

Citation

@article{ape2026,
    title={{APE}: Agentic Prompt Enhancer for Image Generation and Editing},
    author={Huang, Zijian and Wu, Jay Zhangjie and Wang, Zian and Cao, Tianshi and Chen, Jiasi and Fidler, Sanja and Ling, Huan and Ren, Xuanchi},
    journal={arXiv preprint},
    year={2026}
}