CARV: Variance Reduction for Expectations with Diffusion Teachers

International Conference on Machine Learning (ICML) 2026
Structured Probabilistic Inference & Generative Modeling (SPIGM) Workshop

TL;DR

TL;DR. We propose CARV, a compute-aware variance-reduction framework for diffusion-teacher gradients, which gives a 2 to 3 times effective compute multiplier on diffusion-guided optimization and data attribution without changing the objective.

Abstract

Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as 3D generation, single-step distillation, and data attribution. The teacher gradients these pipelines consume are Monte Carlo expectations over noise levels and Gaussian noise samples; their estimator variance dominates compute cost because each draw requires expensive upstream work (rendering, simulation, encoding). We introduce CARV, a compute-aware variance-accounting framework that motivates a hierarchical Monte Carlo estimator: amortize the expensive upstream computation over cheap diffusion-noise resamples, sharpened by timestep importance sampling and a stratified inverse-CDF construction. In 3D generation and attribution experiments, CARV delivers 2 to 3 times effective compute multipliers (most from amortized reuse, with about 25% additional from importance sampling and stratification) without changing the objective. In single-step distillation, the same techniques cut gradient variance by an order of magnitude but do not improve downstream FID, indicating that MC variance is no longer the bottleneck in that setting.

Same per-step compute with sharper text-to-3D.

Both columns run at the same per-step compute: standard SDS on the left, CARV on the right.

A castle-shaped sandcastle
Baseline
CARV
A crumpled soda can
Baseline
CARV

Why MC variance dominates compute

Teacher-guided pipelines all look the same on paper: form a stochastic gradient estimate, take a step, repeat. But the expensive part is rarely the diffusion model. It is the upstream render, encode, or simulate that runs before the diffusion teacher sees a sample.

One render, many noise samples

SDS renders a 3D scene, encodes it, then asks the teacher for a denoising direction. Render and encode are the bottleneck; noising and denoising are cheap. So we make the cheap part work harder: resample $(t, \epsilon)$ many times per render, reuse the encoded latent.

Variance, not bias, sets the budget

Most pipelines inherit a timestep distribution, average more samples, and tune compute blindly. CARV measures gradient variance against compute and tells you which axis to attack: timestep allocation, noise resamples, or render reuse.

Method: three drop-in fixes

All three fixes are unbiased under the stated sampling, replace a few lines of the per-step sampling logic, and stack on top of any SDS, DMD, or TRAK pipeline that already calls a diffusion teacher.

1. Amortized compute reuse: the largest contribution
Hold the expensive render and encode fixed and resample the cheap diffusion noise. Cache the latent of one render and pair it with $K$ different $(t, \epsilon)$ draws. The estimator stays unbiased, the per-step variance drops, and most of the freed budget goes to actual optimization signal instead of repeated rendering.
model render re-noise Compute-reuse strategy: one render plus multiple noise resamples on the right vs the standard one-render-one-noise baseline on the left.
Baseline Re-use (ours)
Render once, resample diffusion noise many times. Both render, encode, noise, denoise, and backpropagate; re-use yields $R\times K$ gradient vectors $g(x^{(r)}, t^{(r,k)}, \epsilon^{(r,k)})$ at the cost of one extra denoise per draw. Helps when $(t, \epsilon)$ drives variance and denoising is cheaper than rendering.
2. Timestep importance sampling: weight by where the gradient is
The variance-minimizing proposal is $q^\star(t) \propto p(t)\sqrt{\mathbb{E}[\,\|\mathbf{f}(t,\epsilon)\|^2 \mid t]}$. The renderer/encoder Jacobian reshapes timestep dependence: latent gradients peak mid-schedule, but parameter gradients we backpropagate track $w_{\mathrm{SDS}}(t)=\sigma_t^2$. So $w_{\mathrm{SDS}}$ is a free, accurate oracle surrogate.
Toy importance-sampling illustration: a test function with a sharp peak near t=0.75 is integrated under three proposals - uniform (variance 0.77, 1.0x compute), Gaussian-at-peak surrogate (variance 0.35, 2.2x), and the intractable oracle proportional to sqrt of expected squared f (variance 0.19, 4.1x).
Toy · oracle vs. surrogates
Sampled latent gradient norm
Sampled SDS latent-space gradient norm vs timestep t: density and mean curve are non-monotonic, peaking near t=700, then dropping.
Timestep index $t$
Latent space · non-monotonic
Sampled parameter gradient norm
Sampled SDS parameter gradient norm vs timestep t: density and mean rise monotonically with t and the dotted green w(t)=sigma_t^2 curve overlays the empirical mean.
Timestep index $t$
Parameter space · tracks $w_{\mathrm{SDS}}(t)$
From toy to real data. Left: a toy illustration of the oracle importance proposal $q^\star(t) \propto p(t)\sqrt{\mathbb{E}[\|\mathbf{f}(t,\epsilon)\|^2 \mid t]}$, variance-equivalent to about $4\times$ uniform compute on this test function; a Gaussian-at-peak surrogate is worth ${\sim}2.2\times$. Middle: on real SDS runs the latent-space gradient norm is non-monotonic and peaks mid-schedule, so $w_{\mathrm{SDS}}$ alone is a poor proposal in latent space. Right: after backprop through the renderer and encoder, the parameter gradient is monotonic in $t$ and closely tracks $w_{\mathrm{SDS}}(t)=\sigma_t^2$ (dotted green), so $w_{\mathrm{SDS}}$ is a faithful zero-cost surrogate at the level we actually update.
3. Stratified inverse-CDF: spread your samples
Partition the timestep domain into $B$ bins, require one sample per bin, then map through the inverse-CDF of the importance-sampling proposal. Unbiased, never higher variance than IID at the same per-step cost, and stacks on top of step (2).
Stratified vs IID sampling: three batches each, IID clusters samples while stratified hits every bin.
Stratified vs. IID samples
Inverse-CDF construction: stratified u in [0,1] mapped through the CDF to non-uniform t samples that match the SDS importance weighting.
Inverse-CDF mapping to $q(t) \propto w_{\mathrm{SDS}}(t)$
Stratifying $u \in [0,1]$ guarantees one sample per bin; pushing through the inverse-CDF then maps those uniform draws into stratified samples from the importance-weighted proposal $q(t) \propto w_{\mathrm{SDS}}(t)$. Combines step (2) and step (3) without extra compute.

Results: effective compute multipliers

2.6 to 3.3×
SDS effective compute multiplier
Text-to-3D gradient variance, equal-cost baseline
2 to 3×
Data-attribution multiplier
Influence-score variance on Wan2.1 video attribution
10×
DMD gradient variance reduction
FID does not follow; see DMD section below

Text-to-3D distillation (SDS)

Three views of the same SDS sweep: CLIP score over training, qualitative renders along the same training axis, and the swept effective-compute frontier.

CLIP score vs training step (log x axis), averaged over 30 prompts and 3 seeds. IW+Strat (1,16) sits above Uniform (4,1) for the entire ramp before both saturate at ~0.335.
CLIP score · 30 prompts, 3 seeds, multi-view
Qualitative SDS renders over training iterations 0, 500, 1000, 2000, 4000 for two prompts (castle, apple), comparing baseline (4,1) to IW+Strat (1,16). IW+Strat resolves coherent geometry several thousand iterations earlier than baseline at the same per-iter cost.
Renders along the same training axis
Matched per-iteration cost (about 300-400 ms / iter). Top: CLIP score across training; CARV (red, IW+Strat, 1,16) sits above the uniform baseline (blue, 4,1) for the entire ramp before both saturate. Bottom: renders along the same iteration axis - CARV reaches near-converged geometry at $\sim$1k iterations while the baseline is still a coarse blob; both prompts show roughly $2\times$ wall-clock speedup to comparable quality.

Effective compute multiplier (ECM) vs. equal-cost uniform baseline, swept over $(R, K)$ pairs at fixed total budget. The IW + stratified frontier (red) sits above uniform-only re-use (blue) at every compute point, peaking at 3.3× at $(R{=}4, K{=}8)$.

SDS effective compute multiplier vs total time per iteration. Red diamonds = IW + Stratified, blue circles = uniform; CARV peaks at ~3.3x at (R=4, K=8).
Each marker is a $(R, K)$ pair (renders, re-noisings); annotation shows the count. Red diamonds = IW + Stratified, blue circles = uniform. The dashed lines trace $(R{=}1, K)$ as $K$ grows. Hierarchical re-use drives most of the lift; IW + stratification add a free 20-25% margin on top.

Qualitative still renders

Final renders at the matched-budget end of the SDS runs. Baseline left, CARV right. Same prompts as the autoplay turntables above; these are the static frames at the end of training.

Apple, baseline Apple, CARV
BaselineCARV
A red apple
Beetle, baseline Beetle, CARV
BaselineCARV
A jeweled beetle
Emerald, baseline Emerald, CARV
BaselineCARV
An emerald gemstone
Watch, baseline Watch, CARV
BaselineCARV
A pocket watch
Crumpled soda can, baseline still Crumpled soda can, CARV still
BaselineCARV
A crumpled soda can
Ice cream, baseline Ice cream, CARV
BaselineCARV
An ice cream cone

Video data attribution (MOTIVE / Wan2.1)

The same idea drops into MOTIVE-style video data attribution: gradient computations on Wan2.1-T2V become much cheaper to estimate accurately. Stratified sampling reaches the same correlation with ground-truth rankings using far fewer per-clip gradient samples than the IID baseline. Same 2 to 3 times multiplier story; full sweeps in the paper.

Mean Correlation
Mean correlation between estimator-based and ground-truth influence rankings vs gradient samples per data point. Stratified sampling reaches 1.0 by ~400 samples; IID baseline still climbing past 700.
Gradient Samples per Data Point
Mean correlation of limited-sample influence rankings with ground-truth gradients on Wan2.1 video attribution (VIDGEN-1M, MOTIVE setup, leave-one-out over 11 queries). Stratified sampling (red) saturates near 1.0 well before the IID baseline (blue): a 1.3 to 3.8 times compute multiplier across budgets, over 2 times at practical budgets.

When MC variance is not the bottleneck: DMD

The same techniques cut DMD gradient variance by an order of magnitude, yet downstream FID barely moves. We treat this as a feature: DMD marks where MC variance stops being the bottleneck, and where the next gain must come from auxiliary stabilizers or input diversity.

DMD student FID curves over training, IID baseline variants vs Stratified (8, 16x). All four IID curves and the Stratified curve track within noise; variance fell ~10x but FID did not follow.
Student FID across training. Stratified ($R{=}8$, $K{=}16$) cuts gradient variance by an order of magnitude over the matched IID setting (orange vs. blue dashed); FID stays inside the noise band of the IID curves. CARV is still unbiased and correct here, but lower variance does not unlock a better student in this regime.

When CARV helps, and when it does not

use it

Use CARV when…

  • The MC gradient (not its target) dominates wall-clock cost.
  • One render or encode is much more expensive than one diffusion noising step.
  • Gradient variance is plausibly limiting convergence (early training, low classifier-free guidance, sparse signal).
  • Pipelines: SDS (text-to-3D, audio-SDS, materials), data attribution (TRAK, MOTIVE), influence scoring, anything where rendering or encoding is the per-step cost.
caution

Be sceptical when…

  • Convergence is bottlenecked by other dynamics (auxiliary stabilizers, input diversity, optimizer state).
  • Render and denoise costs are comparable, so the multiplier collapses.
  • Pipelines: DMD-style single-step distillation, where 10 times variance reduction did not improve FID. CARV remains correct, but it does not reduce FID in this regime.

Takeaways

The framing

  • Diffusion-teacher gradients are Monte Carlo expectations.
  • Their variance per compute is the right axis to optimise, not raw step count.
  • The expensive randomness is the render or encode, not the noise.

The fixes

  • Compute reuse: render once, resample noise $K$ times.
  • Importance sampling: $q \propto p\,w_{\mathrm{SDS}}$, no extra compute.
  • Stratification: one sample per bin via inverse-CDF.
  • All three are unbiased and replace a few lines of code.

The limits

  • 2 to 3 times ECM on SDS and attribution.
  • 10 times variance reduction on DMD, but no FID gain.
  • Variance was the bottleneck on SDS, but not on DMD.

Citation

@misc{bettencourt2026carv,
  title={Variance Reduction for Expectations with Diffusion Teachers},
  author={Bettencourt, Jesse and Wu, Xindi and Atzmon, Matan and Lucas, James and Lorraine, Jonathan},
  year={2026},
  eprint={2605.21489},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2605.21489},
}

Acknowledgements

We thank the Fundamental Generative AI Research (GenAIR) group at NVIDIA for the reference codebase and many helpful conversations during integration. We thank Sanja Fidler and the Spatial Intelligence Lab (SIL) for hosting the internship that made this collaboration possible. The text-to-3D experiments build on the open-source threestudio framework; the single-step distillation experiments build on the NVIDIA FastGen reference implementation.