Bootstrap Your Generator:
Unpaired Visual Editing with Flow Matching

1NVIDIA, 2Tel Aviv University
*Equal contribution
ICML 2026
TL;DR We propose ByG (pronounced “Big”), a framework for unpaired image and video editing using only the base model’s internal knowledge — no paired data, no external reward models.
Method overview diagram
Pineapple on grass
Pineapple in desert

“Change the setting to a desert”

Tennis player
Tennis player with ball

“Add a tennis ball”

“Make the video photorealistic”

“Change the stripes to deep violet”

Bootstrap Your Generator. Left: Supervised training requires paired source–target samples to provide explicit editing supervision. External model guidance uses a frozen external model to provide semantic feedback. Our intrinsic signal enables training using only the generator itself, removing the need for paired data or external supervision. Right & below: Sample image and video editing results produced by our unpaired approach.

Method overview.

Top: Supervised training for image editing. Given a source image x, target image y, and editing instruction c, the target is noised and fed to the network along with x and c.
Bottom: We finetune a pretrained text-to-image model into an editing model without paired supervision. Given source x and instruction c, a frozen EMA copy generates a noisy pseudo-target via multi-step sampling. The trainable model then predicts the edit, supervised by: (1) a prior loss aligning the edit direction with the base T2I model, and (2) a cycle loss reconstructing x from the edit using the reverse instruction.

Video Editing

Our framework naturally extends to video editing by applying it to a text-to-video model (Wan2.2). Despite using no paired video data, our method significantly outperforms Ditto, a supervised baseline trained on one million video editing pairs, achieving 70.0% win rate for cartoon targets and 80.5% for photo-realistic, averaging 75.3% overall. On out-of-distribution 3D-CGI inputs, it wins 85% of comparisons.

Source Ditto (Supervised) Ours (Unpaired)
To Photo-realistic
80.5%
19.5%
To Cartoon
70%
30%

User study results: our method is preferred in both editing directions.

Long-Tail Style Editing

We evaluate on six unusual stylization targets not represented in common editing benchmarks: GTA V, Minecraft, American comic, low-poly 3D, voxel, and Lego. Our method is not trained on any of these styles, yet it outperforms supervised baselines (FLUX-Kontext, Qwen-Image-Edit) trained on millions of paired edits.

General Image Editing

On the GEdit-Bench general editing benchmark, our unpaired method is competitive with FLUX-Kontext across most categories, notably outperforming it on motion changes, human-centric edits, and style changes — categories where paired supervision may be limited.

Input Ours FLUX-Kontext FlowEdit
“Adjust the image style to a watercolor effect”
Input Ours Kontext FlowEdit
“Replace with jade”
Input Ours Kontext FlowEdit
“Adjust the background to a garden”
Input Ours Kontext FlowEdit

Ablation Study

Each component of our framework is critical. Removing gradient routing, cycle loss leads to stronger edits but degrades source preservation. Without bootstrapping, edits produce artifacts. Without the prior loss, the model collapses to identity mapping.

Gradient Routing

A key insight: one-step predictions used for cycle consistency are blurry and miss fine details, creating a train-test mismatch. Our gradient routing (adapted from Straight-Through Estimation) conditions the reverse pass on clean multi-step predictions while routing gradients through the one-step estimate, bridging this gap.

Additional Results

Source Ditto (Supervised) Ours (Unpaired)
Additional general editing results. Additional general editing results. Additional general editing results. Additional general editing results. Additional general editing results. Additional general editing results.

BibTeX

@article{tewel2026byg,
  title={Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching},
  author={Tewel, Yoad and Atzmon, Yuval and Chechik, Gal and Wolf, Lior},
  journal={arXiv preprint},
  year={2026}
}