Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching

Tewel, Yoad; Atzmon, Yuval; Chechik, Gal; Wolf, Lior

Bootstrap Your Generator:
Unpaired Visual Editing with Flow Matching

Yoad Tewel^1,2*, Yuval Atzmon^1*, Gal Chechik¹, Lior Wolf²

¹NVIDIA, ²Tel Aviv University

^*Equal contribution

ICML 2026

Paper arXiv Cite

Hide video

TL;DR We propose ByG (pronounced “Big”), a framework for unpaired image and video editing using only the base model’s internal knowledge — no paired data, no external reward models.

“Change the setting to a desert”

“Add a tennis ball”

“Make the video photorealistic”

“Change the stripes to deep violet”

Bootstrap Your Generator. Left: Supervised training requires paired source–target samples to provide explicit editing supervision. External model guidance uses a frozen external model to provide semantic feedback. Our intrinsic signal enables training using only the generator itself, removing the need for paired data or external supervision. Right & below: Sample image and video editing results produced by our unpaired approach.

Top: Supervised training for image editing. Given a source image x, target image y, and editing instruction c, the target is noised and fed to the network along with x and c.
Bottom: We finetune a pretrained text-to-image model into an editing model without paired supervision. Given source x and instruction c, a frozen EMA copy generates a noisy pseudo-target via multi-step sampling. The trainable model then predicts the edit, supervised by: (1) a prior loss aligning the edit direction with the base T2I model, and (2) a cycle loss reconstructing x from the edit using the reverse instruction.

Video Editing

Our framework naturally extends to video editing by applying it to a text-to-video model (Wan2.2). Despite using no paired video data, our method significantly outperforms Ditto, a supervised baseline trained on one million video editing pairs, achieving 70.0% win rate for cartoon targets and 80.5% for photo-realistic, averaging 75.3% overall. On out-of-distribution 3D-CGI inputs, it wins 85% of comparisons.

Source Ditto (Supervised) Ours (Unpaired)

To Photo-realistic

80.5%

19.5%

To Cartoon

70%

30%

User study results: our method is preferred in both editing directions.

Long-Tail Style Editing

We evaluate on six unusual stylization targets not represented in common editing benchmarks: GTA V, Minecraft, American comic, low-poly 3D, voxel, and Lego. Our method is not trained on any of these styles, yet it outperforms supervised baselines (FLUX-Kontext, Qwen-Image-Edit) trained on millions of paired edits.

General Image Editing

On the GEdit-Bench general editing benchmark, our unpaired method is competitive with FLUX-Kontext across most categories, notably outperforming it on motion changes, human-centric edits, and style changes — categories where paired supervision may be limited.

Input Ours FLUX-Kontext FlowEdit

“Adjust the image style to a watercolor effect”

“Replace with jade”

“Adjust the background to a garden”

Ablation Study

Each component of our framework is critical. Removing gradient routing, cycle loss leads to stronger edits but degrades source preservation. Without bootstrapping, edits produce artifacts. Without the prior loss, the model collapses to identity mapping.

Gradient Routing

A key insight: one-step predictions used for cycle consistency are blurry and miss fine details, creating a train-test mismatch. Our gradient routing (adapted from Straight-Through Estimation) conditions the reverse pass on clean multi-step predictions while routing gradients through the one-step estimate, bridging this gap.

Additional Results

Source Ditto (Supervised) Ours (Unpaired)

BibTeX

@misc{tewel2026byg,
  title={Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching},
  author={Yoad Tewel and Yuval Atzmon and Gal Chechik and Lior Wolf},
  year={2026},
  eprint={2606.03911},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2606.03911}
}