How does it work?
LoRWeB Overview: We first encode a and a', that describe a visual transformation (e.g., adding a hat to the man), and b, which should be edited analogously (e.g., adding a hat to the woman) with CLIP, and a small learned projection module.
The similarity between the encoded vector and a set of learned keys determines the linear coefficients for combining the learned LoRAs into a single, mixed LoRA. This mixed LoRA is injected into a conditional flow model (e.g., Flux.1-Kontext).
Next, we build a 2×2 composite image from {a, a', b}. The conditional flow model gets this composite image as its input, along with a guiding edit prompt, and produces a composite image with the edited results b' in the bottom-right quadrant.