Spatial Intelligence Lab

Align Your Flow:

Scaling Continuous-Time Flow Map Distillation

1 NVIDIA
2 University of Toronto
3 Vector Institute

Figure 1. Four-step samples generated by our distilled FLUX.1-dev flow map model.

Key Contributions


Abstract


Diffusion and flow-based models have emerged as state-of-the-art generative modeling approaches, but they require many sampling steps. Consistency models can distill these models into efficient one-step generators; however, unlike flow- and diffusion-based methods, their performance inevitably degrades when increasing the number of steps, which we show both analytically and empirically. Flow maps generalize these approaches by connecting any two noise levels in a single step and remain effective across all step counts. In this paper, we introduce two new continuous-time objectives for training flow maps, along with additional novel training techniques, generalizing existing consistency and flow matching objectives. We further demonstrate that autoguidance can improve performance, using a low-quality model for guidance during distillation, and an additional boost can be achieved by adversarial finetuning, with minimal loss in sample diversity. We extensively validate our flow map models, called Align Your Flow, on challenging image generation benchmarks and achieve state-of-the-art few-step generation performance on both ImageNet 64x64 and 512x512, using small and efficient neural networks. Finally, we show text-to-image flow map models that outperform all existing non-adversarially trained few-step samplers in text-conditioned synthesis.

Figure 2. Flow maps generalize both consistency models and flow matching by connecting any two noise levels \(s, t\) in a single step. When \(s=0\), flow maps reduce to consistency models; when \(s \to t\) they are equivalent to standard flow matching models. Our proposed AYF-EMD objective similarly generalizes the continuous-time consistency and flow matching losses.

Consistency Models are Flawed Multi-Step Generators


Diffusion and flow-based models have revolutionized generative modeling, but they rely on slow iterative sampling. Consistency models are a popular approach to distill these models into efficient one-step generators. They learn to map samples that lie on deterministic noise-to-data paths to the same, consistent clean output in a single step. However, they have been empirically shown to degrade in performance as the number of sampling steps increases. In this work, we analytically explain this phenomenon and show that consistency models are inherently incompatible with multi-step sampling.

Specifically, we analyze a simple case where the initial distribution is an isotropic Gaussian with standard deviation \(c\), i.e. \(p_{\textrm{data}}(\mathbf{x}) = \mathcal{N}(\mathbf{0}, c^2\mathbf{I})\). Let \(\mathbf{f}^*(\mathbf{x}_t, t)\) denote the optimal consistency model. We show that regardless of how small an error \(\epsilon\) we consider, there exists an imperfect consistency model \(\mathbf{f}(\mathbf{x}_t, t)\) such that \(\mathbb{E}_{\mathbf{x}_t \sim p(\mathbf{x},t)}\!\bigl[\| \mathbf{f}(\mathbf{x}_t, t) - \mathbf{f}^*(\mathbf{x}_t, t) \|_2^2\bigr] < \epsilon\) for all \(t \in [0, 1]\), and using \(\mathbf{f}\) for multi-step sampling will lead to error accumulation beyond a certain step count. In other words, even for minimal modeling errors, multi-step sampling with consistency models eventually diverges when increasing the number of sampling steps (see the paper for proof and derivation).

Intuitively, when doing multi-step sampling with consistency models we first perform denoising by removing all noise from a noisy image to obtain a clean one, and then re-add a smaller amount of noise. But because the model is not perfect, the denoised image drifts slightly off the true data manifold. When noise is added back in, the resulting image is now slightly off the noisy manifold the model was trained on. This mismatch compounds: each denoising step starts from a slightly worse input, pushing the sample further off-manifold over time. As a result, errors accumulate with more sampling steps, leading to degrading image quality beyond a certain point.

To overcome this issue, we advocate for the Flow Map framework, which unifies flow matching and consistency models by connecting any two noise levels in a single step. Flow maps offer the best of both worlds: they achieve high-quality samples with just a few sampling steps while remaining naturally compatible with multi-step sampling. Figure 3 shows a comparison between these methods.

Figure 3. FID versus number of sampling steps on ImageNet 512x512 (lower is better). Diffusion models require dozens of steps to reach good quality, and consistency models deteriorate after only a few, whereas our AYF flow maps maintain low FID across the board. AYF is slightly weaker at single-step generation, but a brief adversarial fine-tuning stage closes this gap and improves quality for all numbers of sampling steps with minimal loss in diversity.

Flow Maps: Generalizing Flow Matching and Consistency Models


Flow maps are neural networks \(\mathbf{f}_\theta(\mathbf{x}_t, t, s)\) that map a noisy input \(\mathbf{x}_t\) directly to any other point \(\mathbf{x}_s\) by following the probability flow ODE (PF-ODE) from time \(t\) to \(s\). When \(s = 0\), they reduce to standard CMs. When performing many small steps, i.e. \(s \to t\), they reduce to regular flow matching models (see Figure 2). In this work, we propose two continuous-time objectives for training flow maps, as well as training techniques to improve sample quality. For proofs, derivations and more, please refer to the paper.

Our first objective, which we call the AYF-Eulerian Map Distillation (AYF-EMD) objective, is similar to the objective used to train continuous-time consistency models. This objective aims to ensure that for a fixed \(s\), the output of the flow map remains constant as we move \((\mathbf{x}_t, t)\) along the PF-ODE. This objective reduces to the continuous-time consistency model objective when \(s = 0\), and reduces to the flow matching objective when \(s \to t\). Note that the time derivative contains \(\theta^- = \text{stopgrad}(\theta)\), meaning that this objective does not require backpropagating through the time derivative, avoiding unstable second-order derivatives during training. This objective is compatible with both distillation and training from scratch.

\[ \text{AYF}_{\text{EMD}} = \nabla_\theta \mathbb{E}_{\mathbf{x}_t, t, s} \left[ w(t, s) \mathrm{sign}(t - s) \cdot \mathbf{f}_\theta^{\top}(\mathbf{x}_t, t, s) \cdot \frac{\mathrm{d} \mathbf{f}_{\theta^-}(\mathbf{x}_t, t, s)}{\mathrm{d}t} \right] \]

Our second objective, which we call the AYF-Lagrangian Map Distillation (AYF-LMD) objective, is similar to the objective proposed in Flow Map Matching. This method tries to ensure that for a fixed \((\mathbf{x}_t, t)\), the trajectory \(\mathbf{f}_\theta(\mathbf{x}_t, t, \cdot)\) is aligned with the PF-ODE at all points. However, similar to AYF-EMD and unlike the one proposed in Flow Map Matching, this also does not require backpropagating through the time derivative. This objective is only compatible with distillation and assumes access to a pretrained flow-based model \(\mathbf{v}_\phi(\mathbf{x}_t, t) \).

\[ \text{AYF}_{\text{LMD}} = \nabla_\theta \mathbb{E}_{\mathbf{x}_t, t, s} \left[ w(t, s) \, \mathrm{sign}(s - t) \cdot \mathbf{f}_\theta^{\top}(\mathbf{x}_t, t, s) \cdot \left( \frac{\mathrm{d} \mathbf{f}_{\theta^-}(\mathbf{x}_t, t, s)}{\mathrm{d}s} - \mathbf{v}_\phi(\mathbf{f}_{\theta^-}(\mathbf{x}_t, t, s), s) \right) \right] \]

In this work, we focus on distillation. We assume access to a pretrained diffusion or flow matching model, and aim to distill it into a flow map. Following prior works, we distill guidance into the flow map, and show that this can be improved further by using autoguidance as opposed to classifier-free guidance. A detailed training algorithm is provided in the paper.

Generation Results


To evaluate the effectiveness of our AYF flow maps, we perform rigorous quantitative experiments on standard image generation benchmarks (ImageNet 64x64, ImageNet 512x512). We show that our flow map models effectively solve the multi-step sampling problem of CMs, and are effective image generators at all step counts (measured by FID), please see Figure 3. We also show that given a pretrained flow map model, a short finetuning stage with an adversarial loss can significantly boost the performance for all numbers of sampling steps with a minimal impact on sample diversity (measured by recall scores); see Table 1 in the paper for more details. Putting everything together, we show that we can distill state-of-the-art flow maps using small networks that achieve the best speed-quality trade-off across the board, as seen in Figure 5. Extensive comparisons to baselines can be found in the tables in the paper.

Figure 5. By combining flow maps with our proposed training techniques (autoguidance and adversarial finetuning), we distill state-of-the-art flow maps using small and highly efficient networks (using the S-sized architecture from EDM2) that deliver the best speed-quality trade-off across the board. These small models outperform much larger consistency baselines at every runtime budget, even when they sample in a single step. sCD and sCT refer to the distilled and from-scratch continuous-time consistency models from sCM, respectively.

ImageNet 512x512

In this section, we show some one- and two-step samples generated by our distilled flow map model on ImageNet 512x512.

  • goldfish
  • owl
  • bubble
  • axolotl
  • peacock
  • hourglass
  • parrot
  • jacamar
  • castle
  • toucan
  • dog
  • pomegranate
  • dog2
  • cat
  • beach
  • ladybug
  • hamster
  • tomato
  • squirrel
  • ferret
  • vase

One-step samples generated by our AYF flow map model on ImageNet 512x512.

  • anemone
  • dog
  • panda
  • plant
  • dog2
  • mountain
  • mushroom
  • wolf
  • goldfish
  • eagle
  • axolotl
  • parrot
  • beach
  • flower

Two-step samples generated by our AYF flow map model on ImageNet 512x512.

Text-to-Image

We also evaluate our method on text-to-image generation by distilling the FLUX.1-dev model. Following prior work, we add a lightweight LoRA to the base model and finetune it using the AYF-EMD objective. This finetuning process is quick, taking only about 4 hours on 8x A100 GPUs.

Four-step samples from the distilled text-to-image flow map model are shown in Figure 1, and below we show further samples and compare them to those from LCM and TCD, two prior LoRA-based consistency-distilled models that are built on SDXL. Our model generates images with more visual detail and better text-alignment, as confirmed by a user study reported in the paper. Hover over the images for zoom-ins.

LCM
TCD
AYF (ours)

Text prompt: "Cute jumping spider in pirate hat on a yellow daisy"

Text prompt: "Red squirrel drumming on tiny twig and acorn drums in autumn woods"

Text prompt: "Misty mountain lake at dawn, with calm water and lush pine trees"

Text prompt: "Neon-blue glowing gecko on finger in cyberpunk city at night"

Text prompt: "A lone samurai faces a crimson forest at dusk. Leaves swirl, a bridge crosses a koi pond, and a distant temple glows in the fading light"

Text prompt: "A floating carnival drifts above the clouds neon lit rides, surreal Ferris cabins, and balloons gliding through caramel-scented air"

Text prompt: "A weathered yellow robot cat plays with a white rabbit with a flower on its back in a rainy, misty alley, with rusted buildings looming overhead."

Paper


Align Your Flow: Scaling Continuous-Time Flow Map Distillation

Amirmojtaba Sabour, Sanja Fidler, Karsten Kreis

arXiv, 2025

description Paper
description BibTeX

Citation



      @misc{sabour2025align,
        title={Align Your Flow: Scaling Continuous-Time Flow Map Distillation},
        author={Amirmojtaba Sabour and Sanja Fidler and Karsten Kreis},
        year={2025},
        eprint={2506.14603},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
      }