Score-Based Generative Modeling
with Critically-Damped Langevin Diffusion

Tim Dockhorn^1,2,3 Arash Vahdat¹ Karsten Kreis¹

¹ NVIDIA ² University of Waterloo ³ Vector Institute

ICLR 2022 (spotlight)

description Paper

code Code

News

event [Mar 2022] Our code has been released.

event [Feb 2022] Karsten presented our work at Deep Learning: Classics and Trends by ML Collective (slides).

event [Jan 2022] Our paper got accepted at the International Conference on Learning Representations (ICLR) as a spotlight presentation! It received average reviewer ratings of 8.5, which makes it a top 0.4% submission!

event [Jan 2022] Tim presented our work at the Vector Institute.

event [Dec 2021] Twitter thread explaining the work in detail.

event [Dec 2021] Project page released!

event [Dec 2021] Draft released on arXiv!

Abstract

Score-based generative models (SGMs) have demonstrated remarkable synthesis quality. SGMs rely on a diffusion process that gradually perturbs the data towards a tractable distribution, while the generative model learns to denoise. The complexity of this denoising task is, apart from the data distribution itself, uniquely determined by the diffusion process. We argue that current SGMs employ overly simplistic diffusions, leading to unnecessarily complex denoising processes, which limit generative modeling performance. Based on connections to statistical mechanics, we propose a novel critically-damped Langevin diffusion (CLD) and show that CLD-based SGMs achieve superior performance. CLD can be interpreted as running a joint diffusion in an extended space, where the auxiliary variables can be considered "velocities" that are coupled to the data variables as in Hamiltonian dynamics. We derive a novel score matching objective for CLD and show that the model only needs to learn the score function of the conditional distribution of the velocity given data, an easier task than learning scores of the data directly. We also derive a new sampling scheme for efficient synthesis from CLD-based diffusion models. We find that CLD outperforms previous SGMs in synthesis quality for similar network architectures and sampling compute budgets. We show that our novel sampler for CLD significantly outperforms solvers such as Euler—Maruyama. Our framework provides new insights into score-based denoising diffusion models and can be readily used for high-resolution image synthesis.

Score-Based Generative Modeling
with Critically-Damped Langevin Diffusion

Score-based generative models (SGMs) and denoising diffusion probabilistic models have emerged as a promising class of generative models. SGMs offer high quality synthesis and sample diversity, do not require adversarial objectives, and have found applications in image, speech, and music synthesis, image editing, super-resolution, image-to-image translation, and 3D shape generation. SGMs use a diffusion process to gradually add noise to the data, transforming a complex data distribution to an analytically tractable prior distribution. A neural network is then utilized to learn the score function—the gradient of the log probability density—of the perturbed data. The learnt scores can be used to solve a stochastic differential equation (SDE) to synthesize new samples. This corresponds to an iterative denoising process, inverting the forward diffusion.

It has been shown that the score function that needs to be learnt by the neural network is uniquely determined by the forward diffusion process. Consequently, the complexity of the learning problem depends, other than on the data itself, only on the diffusion. Hence, the diffusion process is the key component of SGMs that needs to be revisited to further improve SGMs, for example, in terms of synthesis quality or sampling speed.

Inspired by statistical mechanics, we propose a novel forward diffusion process, the critically-damped Langevin diffusion (CLD). In CLD, the data variable, \(\bf{x}_t\) (time \(t\) along the diffusion), is augmented with an additional "velocity" variable \(\bf{v}_t\) and a diffusion process is run in the joint data-velocity space. Data and velocity are coupled to each other as in Hamiltonian dynamics, and noise is injected only into the velocity variable. Similarly as in Hamiltonian Monte Carlo, the Hamiltonian component helps to efficiently traverse the joint data-velocity space and to transform the data distribution into the prior distribution more smoothly. We derive the corresponding score matching objective and show that for CLD the neural network is tasked with learning only the score of the conditional distribution of velocity given data \(\nabla_{\bf{v}_t} \log p_t(\bf{v}_t |\bf{x}_t )\), which is arguably easier than learning the score of the diffused data distribution directly. Using techniques from molecular dynamics, we also derive a novel SDE integrator tailored to CLD's reverse-time synthesis SDE.

Schematic visualization of CLD's forward diffusion as well as reverse-time synthesis process: At the top, we visualize how a one-dimensional data distribution (mixture of three Normals) together with the velocity diffuses towards the prior in the joint data-velocity space and how generation proceeds in the reverse direction. We sample three different diffusion trajectories (in green) and also show the projections into data and velocity space on the right. We can see smooth diffusion trajectories for the data variables. At the bottom, we visualize a similar diffusion and synthesis process for (high-dimensional) image generation. We see that the velocities "encode" the data at intermediate times \(t\).

Technical Contributions

We make the following technical contributions:

We propose CLD, a novel diffusion process for SGMs.
We derive a score matching objective for CLD which requires only the score of the conditional distribution of velocity given data.
We propose hybrid denoising score matching, a new type of denoising score matching ideally suited for scalable training of CLD-based SGMs.
We derive a tailored SDE integrator that enables efficient sampling from CLD-based models.
Overall, we provide novel insights into SGMs and point out important new connections to statistical mechanics.

Experimental Results

We extensively validate CLD and the new SDE solver:

We show that the neural networks learnt in CLD-based SGMs are smoother than those of previous SGMs. We attribute this to the Hamiltonian component in the diffusion and to CLD’s easier score function target, the score of the velocity-data conditional distribution \(\nabla_{\bf{v}_t} \log p_t(\bf{v}_t |\bf{x}_t )\).
On the CIFAR-10 image modeling benchmark, we demonstrate that CLD-based models outperform previous diffusion models in synthesis quality for similar network architectures and sampling compute budgets. Our CLD-based SGMs achieve FID scores of 2.25 and 2.23 using probability flow ODE sampling and generative SDE sampling, respectively.
We show that our novel SDE integrator for CLD is well suited for synthesis with limited neural network calls and significantly outperforms the popular Euler–Maruyama method.
We perform ablations on various aspects of CLD and find that CLD does not have difficult-to-tune hyperparameters.

Samples from our CLD-based SGMs as well as latent space traversals and sample generation paths are visualized below.