Understanding Diffusion Models: A Unified Perspective

Calvin Luo
Google Research, Brain Team
calvinluo@google.com

August 26, 2022

Introduction: Generative Models
Background: ELBO, VAE, and Hierarchical VAE
Variational Diffusion Models
- Learning Diffusion Noise Parameters
- Three Equivalent Interpretations
Score-based Generative Models
Guidance
- Classifier Guidance
- Classifier-Free Guidance
Closing

Introduction: Generative Models

Given observed samples $x$ from a distribution of interest, the goal of a generative model is to learn to model its true data distribution $p (x)$ . Once learned, we can generate new samples from our approximate model at will. Furthermore, under some formulations, we are able to use the learned model to evaluate the likelihood of observed or sampled data as well.

There are several well-known directions in current literature, that we will only introduce briefly at a high level. Generative Adversarial Networks (GANs) model the sampling procedure of a complex distribution, which is learned in an adversarial manner. Another class of generative models, termed "likelihood-based", seeks to learn a model that assigns a high likelihood to the observed data samples. This includes autoregressive models, normalizing flows, and Variational Autoencoders (VAEs). Another similar approach is energy-based modeling, in which a distribution is learned as an arbitrarily flexible energy function that is then normalized. Score-based generative models are highly related; instead of learning to model the energy function itself, they learn the score of the energy-based model as a neural network. In this work we explore and review diffusion models, which as we will demonstrate, have both likelihood-based and score-based interpretations. We showcase the math behind such models in excruciating detail, with the aim that anyone can follow along and understand what diffusion models are and how they work.

Background: ELBO, VAE, and Hierarchical VAE

For many modalities, we can think of the data we observe as represented or generated by an associated unseen latent variable, which we can denote by random variable $z$ . The best intuition for expressing this idea is through Plato’s Allegory of the Cave. In the allegory, a group of people are chained inside a cave their entire life and can only see the two-dimensional shadows projected onto a wall in front of them, which are generated by unseen three-dimensional objects passed before a fire. To such people, everything they observe is actually determined by higher-dimensional abstract concepts that they can never behold.

Analogously, the objects that we encounter in the actual world may also be generated as a function of some higher-level representations; for example, such representations may encapsulate abstract properties such as color, size, shape, and more. Then, what we observe can be interpreted as a three-dimensional projection or instantiation of such abstract concepts, just as what the cave people observe is actually a two-dimensional projection of three-dimensional objects. Whereas the cave people can never see (or even fully comprehend) the hidden objects, they can still reason and draw inferences about them; in a similar way, we can approximate latent representations that describe the data we observe.

Whereas Plato’s Allegory illustrates the idea behind latent variables as potentially unobservable representations that determine observations, a caveat of this analogy is that in generative modeling, we generally seek to learn lower-dimensional latent representations rather than higher-dimensional ones. This is because trying to learn a representation of higher dimension than the observation is a fruitless endeavor without strong priors. On the other hand, learning lower-dimensional latents can also be seen as a form of compression, and can potentially uncover semantically meaningful structure describing observations.

Evidence Lower Bound

Mathematically, we can imagine the latent variables and the data we observe as modeled by a joint distribution $p (x, z)$ . Recall one approach of generative modeling, termed "likelihood-based", is to learn a model to maximize the likelihood $p (x)$ of all observed $x$ . There are two ways we can manipulate this joint distribution to recover the likelihood of purely our observed data $p (x)$ ; we can explicitly marginalize out the latent variable $z$ :

p (x) = \int p (x, z) d z

(1)

or, we could also appeal to the chain rule of probability:

p (x) = \frac{p (x, z)}{p (z | x)}

(2)

Directly computing and maximizing the likelihood $p (x)$ is difficult because it either involves integrating out all latent variables $z$ in Equation 1, which is intractable for complex models, or it involves having access to a ground truth latent encoder $p (z | x)$ in Equation 2. However, using these two equations, we can derive a term called the Evidence Lower Bound (ELBO), which as its name suggests, is a lower bound of the evidence. The evidence is quantified in this case as the log likelihood of the observed data. Then, maximizing the ELBO becomes a proxy objective with which to optimize a latent variable model; in the best case, when the ELBO is powerfully parameterized and perfectly optimized, it becomes exactly equivalent to the evidence. Formally, the equation of the ELBO is:

E_{q_{ϕ} (z | x)} [log \frac{p (x, z)}{q_{ϕ} (z | x)}]

(3)

To make the relationship with the evidence explicit, we can mathematically write:

log p (x) \geq E_{q_{ϕ} (z | x)} [log \frac{p (x, z)}{q_{ϕ} (z | x)}]

(4)

Here, $q_{ϕ} (z | x)$ is a flexible approximate variational distribution with parameters $ϕ$ that we seek to optimize. Intuitively, it can be thought of as a parameterizable model that is learned to estimate the true distribution over latent variables for given observations $x$ ; in other words, it seeks to approximate true posterior $p (z | x)$ . As we will see when exploring the Variational Autoencoder, as we increase the lower bound by tuning the parameters $ϕ$ to maximize the ELBO, we gain access to components that can be used to model the true data distribution and sample from it, thus learning a generative model. For now, let us try to dive deeper into why the ELBO is an objective we would like to maximize.

Let us begin by deriving the ELBO, using Equation 1:

$log p (x)$	$= log \int p (x, z) d z$	(Apply Equation 1)	(5)
	$= log \int \frac{p (x, z) q_{ϕ} (z \| x)}{q_{ϕ} (z \| x)} d z$	(Multiply by $1 = \frac{q_{ϕ} (z \| x)}{q_{ϕ} (z \| x)}$ )	(6)
	$= log E_{q_{ϕ} (z \| x)} [\frac{p (x, z)}{q_{ϕ} (z \| x)}]$	(Definition of Expectation)	(7)
	$\geq E_{q_{ϕ} (z \| x)} [log \frac{p (x, z)}{q_{ϕ} (z \| x)}]$	(Apply Jensen’s Inequality)	(8)

In this derivation, we directly arrive at our lower bound by applying Jensen’s Inequality. However, this does not supply us much useful information about what is actually going on underneath the hood; crucially, this proof gives no intuition on exactly why the ELBO is actually a lower bound of the evidence, as Jensen’s Inequality handwaves it away. Furthermore, simply knowing that the ELBO is truly a lower bound of the data does not really tell us why we want to maximize it as an objective. To better understand the relationship between the evidence and the ELBO, let us perform another derivation, this time using Equation 2:

$log p (x)$	$= log p (x) \int q_{ϕ} (z \| x) d z$	(Multiply by $1 = \int q_{ϕ} (z \| x) d z$ )	(9)
	$= \int q_{ϕ} (z \| x) (log p (x)) d z$	(Bring evidence into integral)	(10)
	$= E_{q_{ϕ} (z \| x)} [log p (x)]$	(Definition of Expectation)	(11)
	$= E_{q_{ϕ} (z \| x)} [log \frac{p (x, z)}{p (z \| x)}]$	(Apply Equation 2)	(12)
	$= E_{q_{ϕ} (z \| x)} [log \frac{p (x, z) q_{ϕ} (z \| x)}{p (z \| x) q_{ϕ} (z \| x)}]$	(Multiply by $1 = \frac{q_{ϕ} (z \| x)}{q_{ϕ} (z \| x)}$ )	(13)
	$= E_{q_{ϕ} (z \| x)} [log \frac{p (x, z)}{q_{ϕ} (z \| x)}] + E_{q_{ϕ} (z \| x)} [log \frac{q_{ϕ} (z \| x)}{p (z \| x)}]$	(Split the Expectation)	(14)
	$= E_{q_{ϕ} (z \| x)} [log \frac{p (x, z)}{q_{ϕ} (z \| x)}] + D_{KL} (q_{ϕ} (z \| x) ∥ p (z \| x))$	(Definition of KL Divergence)	(15)
	$\geq E_{q_{ϕ} (z \| x)} [log \frac{p (x, z)}{q_{ϕ} (z \| x)}]$	(KL Divergence always $\geq 0$ )	(16)

From this derivation, we clearly observe from Equation 15 that the evidence is equal to the ELBO plus the KL Divergence between the approximate posterior $q_{ϕ} (z | x)$ and the true posterior $p (z | x)$ . In fact, it was this KL Divergence term that was magically removed by Jensen’s Inequality in Equation 8 of the first derivation. Understanding this term is the key to understanding not only the relationship between the ELBO and the evidence, but also the reason why optimizing the ELBO is an appropriate objective at all.

Firstly, we now know why the ELBO is indeed a lower bound: the difference between the evidence and the ELBO is a strictly non-negative KL term, thus the value of the ELBO can never exceed the evidence.

Secondly, we explore why we seek to maximize the ELBO. Having introduced latent variables $z$ that we would like to model, our goal is to learn this underlying latent structure that describes our observed data. In other words, we want to optimize the parameters of our variational posterior $q_{ϕ} (z | x)$ to exactly match the true posterior distribution $p (z | x)$ , which is achieved by minimizing their KL Divergence (ideally to zero). Unfortunately, it is intractable to minimize this KL Divergence term directly, as we do not have access to the ground truth $p (z | x)$ distribution. However, notice that on the left hand side of Equation 15, the likelihood of our data (and therefore our evidence term $log p (x)$ ) is always a constant with respect to $ϕ$ , as it is computed by marginalizing out all latents $z$ from the joint distribution $p (x, z)$ and does not depend on $ϕ$ whatsoever. Since the ELBO and KL Divergence terms sum up to a constant, any maximization of the ELBO term with respect to $ϕ$ necessarily invokes an equal minimization of the KL Divergence term. Thus, the ELBO can be maximized as a proxy for learning how to perfectly model the true latent posterior distribution; the more we optimize the ELBO, the closer our approximate posterior gets to the true posterior. Additionally, once trained, the ELBO can be used to estimate the likelihood of observed or generated data as well, since it is learned to approximate the model evidence $log p (x)$ .

Variational Autoencoders

Figure 1: A Variational Autoencoder graphically represented. Here, encoder $q (z | x)$ defines a distribution over latent variables $z$ for observations $x$ , and $p (x | z)$ decodes latent variables into observations.

In the default formulation of the Variational Autoencoder (VAE) [7], we directly maximize the ELBO. This approach is variational, because we optimize for the best $q_{ϕ} (z | x)$ amongst a family of potential posterior distributions parameterized by $ϕ$ . It is called an autoencoder because it is reminiscent of a traditional autoencoder model, where input data is trained to predict itself after undergoing an intermediate bottlenecking representation step. To make this connection explicit, let us dissect the ELBO term further:

$E_{q_{ϕ} (z \| x)} [log \frac{p (x, z)}{q_{ϕ} (z \| x)}]$	$=\scalebox0.98Eqϕ(z\|x)[logpθ(x\|z)p(z)qϕ(z\|x)]$	(17)
		(18)
	$=\scalebox0.98Eqϕ(z\|x)[logpθ(x\|z)]reconstruction term−\scalebox0.98DKL(qϕ(z\|x)∥p(z))prior matching term$	(19)

In this case, we learn an intermediate bottlenecking distribution $q_{ϕ} (z | x)$ that can be treated as an encoder; it transforms inputs into a distribution over possible latents. Simultaneously, we learn a deterministic function $p_{θ} (x | z)$ to convert a given latent vector $z$ into an observation $x$ , which can be interpreted as a decoder.

The two terms in Equation 19 each have intuitive descriptions: the first term measures the reconstruction likelihood of the decoder from our variational distribution; this ensures that the learned distribution is modeling effective latents that the original data can be regenerated from. The second term measures how similar the learned variational distribution is to a prior belief held over latent variables. Minimizing this term encourages the encoder to actually learn a distribution rather than collapse into a Dirac delta function. Maximizing the ELBO is thus equivalent to maximizing its first term and minimizing its second term.

A defining feature of the VAE is how the ELBO is optimized jointly over parameters $ϕ$ and $θ$ . The encoder of the VAE is commonly chosen to model a multivariate Gaussian with diagonal covariance, and the prior is often selected to be a standard multivariate Gaussian:

	$q_{ϕ} (z \| x)$	$= N (z; μ_{ϕ} (x), σ_{ϕ}^{2} (x) I)$		(20)
	$p (z)$	$= N (z; 0, I)$		(21)

Then, the KL divergence term of the ELBO can be computed analytically, and the reconstruction term can be approximated using a Monte Carlo estimate. Our objective can then be rewritten as:

(22)

where latents ${z^{(l)}}_{l = 1}^{L}$ are sampled from $q_{ϕ} (z | x)$ , for every observation $x$ in the dataset. However, a problem arises in this default setup: each $z^{(l)}$ that our loss is computed on is generated by a stochastic sampling procedure, which is generally non-differentiable. Fortunately, this can be addressed via the reparameterization trick when $q_{ϕ} (z | x)$ is designed to model certain distributions, including the multivariate Gaussian.

The reparameterization trick rewrites a random variable as a deterministic function of a noise variable; this allows for the optimization of the non-stochastic terms through gradient descent. For example, samples from a normal distribution $x \sim N (x; μ, σ^{2})$ with arbitrary mean $μ$ and variance $σ^{2}$ can be rewritten as:

x

= μ + σ ϵ with ϵ \sim N (ϵ; 0, I)

In other words, arbitrary Gaussian distributions can be interpreted as standard Gaussians (of which $ϵ$ is a sample) that have their mean shifted from zero to the target mean $μ$ by addition, and their variance stretched by the target variance $σ^{2}$ . Therefore, by the reparameterization trick, sampling from an arbitrary Gaussian distribution can be performed by sampling from a standard Gaussian, scaling the result by the target standard deviation, and shifting it by the target mean.

In a VAE, each $z$ is thus computed as a deterministic function of input $x$ and auxiliary noise variable $ϵ$ :

z

= μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ϵ with ϵ \sim N (ϵ; 0, I)

where $⊙$ represents an element-wise product. Under this reparameterized version of $z$ , gradients can then be computed with respect to $ϕ$ as desired, to optimize $μ_{ϕ}$ and $σ_{ϕ}$ . The VAE therefore utilizes the reparameterization trick and Monte Carlo estimates to optimize the ELBO jointly over $ϕ$ and $θ$ .

After training a VAE, generating new data can be performed by sampling directly from the latent space $p (z)$ and then running it through the decoder. Variational Autoencoders are particularly interesting when the dimensionality of $z$ is less than that of input $x$ , as we might then be learning compact, useful representations. Furthermore, when a semantically meaningful latent space is learned, latent vectors can be edited before being passed to the decoder to more precisely control the data generated.

Hierarchical Variational Autoencoders

A Hierarchical Variational Autoencoder (HVAE) [9, 15] is a generalization of a VAE that extends to multiple hierarchies over latent variables. Under this formulation, latent variables themselves are interpreted as generated from other higher-level, more abstract latents. Intuitively, just as we treat our three-dimensional observed objects as generated from a higher-level abstract latent, the people in Plato’s cave treat three-dimensional objects as latents that generate their two-dimensional observations. Therefore, from the perspective of Plato’s cave dwellers, their observations can be treated as modeled by a latent hierarchy of depth two (or more).

Whereas in the general HVAE with $T$ hierarchical levels, each latent is allowed to condition on all previous latents, in this work we focus on a special case which we call a Markovian HVAE (MHVAE). In a MHVAE, the generative process is a Markov chain; that is, each transition down the hierarchy is Markovian, where decoding each latent $z_{t}$ only conditions on previous latent $z_{t + 1}$ . Intuitively, and visually, this can be seen as simply stacking VAEs on top of each other, as depicted in Figure 2; another appropriate term describing this model is a Recursive VAE. Mathematically, we represent the joint distribution and the posterior of a Markovian HVAE as:

	$p (x, z_{1 : T})$	$= p (z_{T}) p_{θ} (x \| z_{1}) T \prod t = 2 p_{θ} (z_{t - 1} \| z_{t})$		(23)
	$q_{ϕ} (z_{1 : T} \| x)$	$= q_{ϕ} (z_{1} \| x) T \prod t = 2 q_{ϕ} (z_{t} \| z_{t - 1})$		(24)

Then, we can easily extend the ELBO to be:

$log p (x)$	$= log \int p (x, z_{1 : T}) d z_{1 : T}$	(Apply Equation 1)	(25)
	$= log \int \frac{p (x, z_{1 : T}) q_{ϕ} (z_{1 : T} \| x)}{q_{ϕ} (z_{1 : T} \| x)} d z_{1 : T}$	(Multiply by 1 = $\frac{q_{ϕ} (z_{1 : T} \| x)}{q_{ϕ} (z_{1 : T} \| x)}$ )	(26)
	$= log E_{q_{ϕ} (z_{1 : T} \| x)} [\frac{p (x, z_{1 : T})}{q_{ϕ} (z_{1 : T} \| x)}]$	(Definition of Expectation)	(27)
	$\geq E_{q_{ϕ} (z_{1 : T} \| x)} [log \frac{p (x, z_{1 : T})}{q_{ϕ} (z_{1 : T} \| x)}]$	(Apply Jensen’s Inequality)	(28)

Figure 2: A Markovian Hierarchical Variational Autoencoder with $T$ hierarchical latents. The generative process is modeled as a Markov chain, where each latent $z_{t}$ is generated only from the previous latent $z_{t + 1}$ .

We can then plug our joint distribution (Equation 23) and posterior (Equation 24) into Equation 28 to produce an alternate form:

E_{q_{ϕ} (z_{1 : T} | x)} [log \frac{p (x, z_{1 : T})}{q_{ϕ} (z_{1 : T} | x)}]

= E_{q_{ϕ} (z_{1 : T} | x)} [log \frac{p (z_{T}) p_{θ} (x | z_{1}) \prod_{t = 2}^{T} p_{θ} (z_{t - 1} | z_{t})}{q_{ϕ} (z_{1} | x) \prod_{t = 2}^{T} q_{ϕ} (z_{t} | z_{t - 1})}]

(29)

As we will show below, when we investigate Variational Diffusion Models, this objective can be further decomposed into interpretable components.

Figure 3: A visual representation of a Variational Diffusion Model; $x_{0}$ represents true data observations such as natural images, $x_{T}$ represents pure Gaussian noise, and $x_{t}$ is an intermediate noisy version of $x_{0}$ . Each $q (x_{t} | x_{t - 1})$ is modeled as a Gaussian distribution that uses the output of the previous state as its mean.

Variational Diffusion Models

The easiest way to think of a Variational Diffusion Model (VDM) [14, 3, 8] is simply as a Markovian Hierarchical Variational Autoencoder with three key restrictions:

The latent dimension is exactly equal to the data dimension
The structure of the latent encoder at each timestep is not learned; it is pre-defined as a linear Gaussian model. In other words, it is a Gaussian distribution centered around the output of the previous timestep
The Gaussian parameters of the latent encoders vary over time in such a way that the distribution of the latent at final timestep $T$ is a standard Gaussian

Furthermore, we explicitly maintain the Markov property between hierarchical transitions from a standard Markovian Hierarchical Variational Autoencoder.

Let us expand on the implications of these assumptions. From the first restriction, with some abuse of notation, we can now represent both true data samples and latent variables as $x_{t}$ , where $t = 0$ represents true data samples and $t \in [1, T]$ represents a corresponding latent with hierarchy indexed by $t$ . The VDM posterior is the same as the MHVAE posterior (Equation 24), but can now be rewritten as:

q (x_{1 : T} | x_{0}) = T \prod t = 1 q (x_{t} | x_{t - 1})

(30)

From the second assumption, we know that the distribution of each latent variable in the encoder is a Gaussian centered around its previous hierarchical latent. Unlike a Markovian HVAE, the structure of the encoder at each timestep $t$ is not learned; it is fixed as a linear Gaussian model, where the mean and standard deviation can be set beforehand as hyperparameters [3], or learned as parameters [8]. We parameterize the Gaussian encoder with mean $μ_{t} (x_{t}) = \sqrt{α_{t}} x_{t - 1}$ , and variance $Σ_{t} (x_{t}) = (1 - α_{t}) I$ , where the form of the coefficients are chosen such that the variance of the latent variables stay at a similar scale; in other words, the encoding process is variance-preserving. Note that alternate Gaussian parameterizations are allowed, and lead to similar derivations. The main takeaway is that $α_{t}$ is a (potentially learnable) coefficient that can vary with the hierarchical depth $t$ , for flexibility. Mathematically, encoder transitions are denoted as:

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{α_{t}} x_{t - 1}, (1 - α_{t}) I)

(31)

From the third assumption, we know that $α_{t}$ evolves over time according to a fixed or learnable schedule structured such that the distribution of the final latent $p (x_{T})$ is a standard Gaussian. We can then update the joint distribution of a Markovian HVAE (Equation 23) to write the joint distribution for a VDM as:

$p (x_{0 : T})$	$= p (x_{T}) T \prod t = 1 p_{θ} (x_{t - 1} \| x_{t})$	(32)
where,
$p (x_{T})$	$= N (x_{T}; 0, I)$	(33)

Collectively, what this set of assumptions describes is a steady noisification of an image input over time; we progressively corrupt an image by adding Gaussian noise until eventually it becomes completely identical to pure Gaussian noise. Visually, this process is depicted in Figure 3.

Note that our encoder distributions $q (x_{t} | x_{t - 1})$ are no longer parameterized by $ϕ$ , as they are completely modeled as Gaussians with defined mean and variance parameters at each timestep. Therefore, in a VDM, we are only interested in learning conditionals $p_{θ} (x_{t - 1} | x_{t})$ , so that we can simulate new data. After optimizing the VDM, the sampling procedure is as simple as sampling Gaussian noise from $p (x_{T})$ and iteratively running the denoising transitions $p_{θ} (x_{t - 1} | x_{t})$ for $T$ steps to generate a novel $x_{0}$ .

Like any HVAE, the VDM can be optimized by maximizing the ELBO, which can be derived as:

$log p (x)$	$=\scalebox0.9log∫p(x0:T)dx1:T$	(34)
	$=\scalebox0.9log∫p(x0:T)q(x1:T\|x0)q(x1:T\|x0)dx1:T$	(35)
	$=\scalebox0.9logEq(x1:T\|x0)[p(x0:T)q(x1:T\|x0)]$	(36)
	$≥\scalebox0.9Eq(x1:T\|x0)[logp(x0:T)q(x1:T\|x0)]$	(37)
	$=\scalebox0.9Eq(x1:T\|x0)[logp(xT)∏Tt=1pθ(xt−1\|xt)∏Tt=1q(xt\|xt−1)]$	(38)
	$=\scalebox0.9Eq(x1:T\|x0)[logp(xT)pθ(x0\|x1)∏Tt=2pθ(xt−1\|xt)q(xT\|xT−1)∏T−1t=1q(xt\|xt−1)]$	(39)
		(40)
	$=\scalebox0.9Eq(x1:T\|x0)[logp(xT)pθ(x0\|x1)q(xT\|xT−1)]+Eq(x1:T\|x0)[logT−1∏t=1pθ(xt\|xt+1)q(xt\|xt−1)]$	(41)
	$=\scalebox0.9Eq(x1:T\|x0)[logpθ(x0\|x1)]+Eq(x1:T\|x0)[logp(xT)q(xT\|xT−1)]+Eq(x1:T\|x0)[T−1∑t=1logpθ(xt\|xt+1)q(xt\|xt−1)]$	(42)
	$=\scalebox0.9Eq(x1:T\|x0)[logpθ(x0\|x1)]+Eq(x1:T\|x0)[logp(xT)q(xT\|xT−1)]+T−1∑t=1Eq(x1:T\|x0)[logpθ(xt\|xt+1)q(xt\|xt−1)]$	(43)
	$=\scalebox0.9Eq(x1\|x0)[logpθ(x0\|x1)]+Eq(xT−1,xT\|x0)[logp(xT)q(xT\|xT−1)]+T−1∑t=1Eq(xt−1,xt,xt+1\|x0)[logpθ(xt\|xt+1)q(xt\|xt−1)]$	(44)
	$=\scalebox0.9Eq(x1\|x0)[logpθ(x0\|x1)]reconstruction term−\scalebox0.9Eq(xT−1\|x0)[DKL(q(xT\|xT−1)∥p(xT))]prior matching term−\scalebox0.9T−1∑t=1Eq(xt−1,xt+1\|x0)[DKL(q(xt\|xt−1)∥pθ(xt\|xt+1))]consistency term$	(45)

The derived form of the ELBO can be interpreted in terms of its individual components:

Figure 4: Under our first derivation, a VDM can be optimized by ensuring that for every intermediate $x_{t}$ , the posterior from the latent above it $p_{θ} (x_{t} | x_{t + 1})$ matches the Gaussian corruption of the latent before it $q (x_{t} | x_{t - 1})$ . In this figure, for each intermediate $x_{t}$ , we minimize the difference between the distributions represented by the pink and green arrows.

$E_{q (x_{1} | x_{0})} [log p_{θ} (x_{0} | x_{1})]$ can be interpreted as a reconstruction term, predicting the log probability of the original data sample given the first-step latent. This term also appears in a vanilla VAE, and can be trained similarly.
$E_{q (x_{T - 1} | x_{0})} [D_{KL} (q (x_{T} | x_{T - 1}) ∥ p (x_{T}))]$ is a prior matching term; it is minimized when the final latent distribution matches the Gaussian prior. This term requires no optimization, as it has no trainable parameters; furthermore, as we have assumed a large enough $T$ such that the final distribution is Gaussian, this term effectively becomes zero.
$E_{q (x_{t - 1}, x_{t + 1} | x_{0})} [D_{KL} (q (x_{t} | x_{t - 1}) ∥ p_{θ} (x_{t} | x_{t + 1}))]$ is a consistency term; it endeavors to make the distribution at $x_{t}$ consistent, from both forward and backward processes. That is, a denoising step from a noisier image should match the corresponding noising step from a cleaner image, for every intermediate timestep; this is reflected mathematically by the KL Divergence. This term is minimized when we train $p_{θ} (x_{t} | x_{t + 1})$ to match the Gaussian distribution $q (x_{t} | x_{t - 1})$ , which is defined in Equation 31.

Visually, this interpretation of the ELBO is depicted in Figure 4. The cost of optimizing a VDM is primarily dominated by the third term, since we must optimize over all timesteps $t$ .

Under this derivation, all terms of the ELBO are computed as expectations, and can therefore be approximated using Monte Carlo estimates. However, actually optimizing the ELBO using the terms we just derived might be suboptimal; because the consistency term is computed as an expectation over two random variables ${x_{t - 1}, x_{t + 1}}$ for every timestep, the variance of its Monte Carlo estimate could potentially be higher than a term that is estimated using only one random variable per timestep. As it is computed by summing up $T - 1$ consistency terms, the final estimated value of the ELBO may have high variance for large $T$ values.

Let us instead try to derive a form for our ELBO where each term is computed as an expectation over only one random variable at a time. The key insight is that we can rewrite encoder transitions as $q (x_{t} | x_{t - 1}) = q (x_{t} | x_{t - 1}, x_{0})$ , where the extra conditioning term is superfluous due to the Markov property. Then, according to Bayes rule, we can rewrite each transition as:

q (x_{t} | x_{t - 1}, x_{0}) = \frac{q (x_{t - 1} | x_{t}, x_{0}) q (x_{t} | x_{0})}{q (x_{t - 1} | x_{0})}

(46)

Armed with this new equation, we can retry the derivation resuming from the ELBO in Equation 37:

$log p (x)$	$≥\scalebox0.90Eq(x1:T\|x0)[logp(x0:T)q(x1:T\|x0)]$	(47)
	$=\scalebox0.90Eq(x1:T\|x0)[logp(xT)∏Tt=1pθ(xt−1\|xt)∏Tt=1q(xt\|xt−1)]$	(48)
	$=\scalebox0.90Eq(x1:T\|x0)[logp(xT)pθ(x0\|x1)∏Tt=2pθ(xt−1\|xt)q(x1\|x0)∏Tt=2q(xt\|xt−1)]$	(49)
	$=\scalebox0.90Eq(x1:T\|x0)[logp(xT)pθ(x0\|x1)∏Tt=2pθ(xt−1\|xt)q(x1\|x0)∏Tt=2q(xt\|xt−1,x0)]$	(50)
	$=\scalebox0.90Eq(x1:T\|x0)[logpθ(xT)pθ(x0\|x1)q(x1\|x0)+logT∏t=2pθ(xt−1\|xt)q(xt\|xt−1,x0)]$	(51)
	$=\scalebox0.90Eq(x1:T\|x0)⎡⎢ ⎢⎣logp(xT)pθ(x0\|x1)q(x1\|x0)+logT∏t=2pθ(xt−1\|xt)q(xt−1\|xt,x0)q(xt\|x0)q(xt−1\|x0)⎤⎥ ⎥⎦$	(52)
	$=\scalebox0.90Eq(x1:T\|x0)⎡⎢ ⎢ ⎢ ⎢⎣logp(xT)pθ(x0\|x1)q(x1\|x0)+logT∏t=2pθ(xt−1\|xt)q(xt−1\|xt,x0)q(xt\|x0)q(xt−1\|x0)⎤⎥ ⎥ ⎥ ⎥⎦$	(53)
	$=\scalebox0.90Eq(x1:T\|x0)⎡⎣logp(xT)pθ(x0\|x1)q(x1\|x0)+logq(x1\|x0)q(xT\|x0)+logT∏t=2pθ(xt−1\|xt)q(xt−1\|xt,x0)⎤⎦$	(54)
	$=\scalebox0.90Eq(x1:T\|x0)[logp(xT)pθ(x0\|x1)q(xT\|x0)+T∑t=2logpθ(xt−1\|xt)q(xt−1\|xt,x0)]$	(55)
	$=\scalebox0.90Eq(x1:T\|x0)[logpθ(x0\|x1)]+Eq(x1:T\|x0)[logp(xT)q(xT\|x0)]+T∑t=2Eq(x1:T\|x0)[logpθ(xt−1\|xt)q(xt−1\|xt,x0)]$	(56)
	$=\scalebox0.90Eq(x1\|x0)[logpθ(x0\|x1)]+Eq(xT\|x0)[logp(xT)q(xT\|x0)]+T∑t=2Eq(xt,xt−1\|x0)[logpθ(xt−1\|xt)q(xt−1\|xt,x0)]$	(57)
	$=\scalebox0.9Eq(x1\|x0)[logpθ(x0\|x1)]reconstruction term−DKL(q(xT\|x0)∥p(xT))prior matching term−T∑t=2Eq(xt\|x0)[DKL% (q(xt−1\|xt,x0)∥pθ(xt−1\|xt))]denoising matching term$	(58)

We have therefore successfully derived an interpretation for the ELBO that can be estimated with lower variance, as each term is computed as an expectation of at most one random variable at a time. This formulation also has an elegant interpretation, which is revealed when inspecting each individual term:

$E_{q (x_{1} | x_{0})} [log p_{θ} (x_{0} | x_{1})]$ can be interpreted as a reconstruction term; like its analogue in the ELBO of a vanilla VAE, this term can be approximated and optimized using a Monte Carlo estimate.
$D_{KL} (q (x_{T} | x_{0}) ∥ p (x_{T}))$ represents how close the distribution of the final noisified input is to the standard Gaussian prior. It has no trainable parameters, and is also equal to zero under our assumptions.
$E_{q (x_{t} | x_{0})} [D_{KL} (q (x_{t - 1} | x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} | x_{t}))]$ is a denoising matching term. We learn desired denoising transition step $p_{θ} (x_{t - 1} | x_{t})$ as an approximation to tractable, ground-truth denoising transition step $q (x_{t - 1} | x_{t}, x_{0})$ . The $q (x_{t - 1} | x_{t}, x_{0})$ transition step can act as a ground-truth signal, since it defines how to denoise a noisy image $x_{t}$ with access to what the final, completely denoised image $x_{0}$ should be. This term is therefore minimized when the two denoising steps match as closely as possible, as measured by their KL Divergence.

As a side note, one observes that in the process of both ELBO derivations (Equation 45 and Equation 58), only the Markov assumption is used; as a result these formulae will hold true for any arbitrary Markovian HVAE. Furthermore, when we set $T = 1$ , both of the ELBO interpretations for a VDM exactly recreate the ELBO equation of a vanilla VAE, as written in Equation 19.

In this derivation of the ELBO, the bulk of the optimization cost once again lies in the summation term, which dominates the reconstruction term. Whereas each KL Divergence term $D_{KL} (q (x_{t - 1} | x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} | x_{t}))$ is difficult to minimize for arbitrary posteriors in arbitrarily complex Markovian HVAEs due to the added complexity of simultaneously learning the encoder, in a VDM we can leverage the Gaussian transition assumption to make optimization tractable. By Bayes rule, we have:

q (x_{t - 1} | x_{t}, x_{0}) = \frac{q (x_{t} | x_{t - 1}, x_{0}) q (x_{t - 1} | x_{0})}{q (x_{t} | x_{0})}

Figure 5: Depicted is an alternate, lower-variance method to optimize a VDM; we compute the form of ground-truth denoising step $q (x_{t - 1} | x_{t}, x_{0})$ using Bayes rule, and minimize its KL Divergence with our approximate denoising step $p_{θ} (x_{t - 1} | x_{t})$ . This is once again denoted visually by matching the distributions represented by the green arrows with those of the pink arrows. Artistic liberty is at play here; in the full picture, each pink arrow must also stem from $x_{0}$ , as it is also a conditioning term.

As we already know that $q (x_{t} | x_{t - 1}, x_{0}) = q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{α_{t}} x_{t - 1}, (1 - α_{t}) % I)$ from our assumption regarding encoder transitions (Equation 31), what remains is deriving for the forms of $q (x_{t} | x_{0})$ and $q (x_{t - 1} | x_{0})$ . Fortunately, these are also made tractable by utilizing the fact that the encoder transitions of a VDM are linear Gaussian models. Recall that under the reparameterization trick, samples $x_{t} \sim q (x_{t} | x_{t - 1})$ can be rewritten as:

x_{t} = \sqrt{α_{t}} x_{t - 1} + \sqrt{1 - α_{t}} ϵ with ϵ \sim N (ϵ; 0, I)

(59)

and that similarly, samples $x_{t - 1} \sim q (x_{t - 1} | x_{t - 2})$ can be rewritten as:

x_{t - 1} = \sqrt{α_{t - 1}} x_{t - 2} + \sqrt{1 - α_{t - 1}} ϵ with ϵ \sim N (ϵ; 0, I)

(60)

Then, the form of $q (x_{t} | x_{0})$ can be recursively derived through repeated applications of the reparameterization trick. Suppose that we have access to 2 $T$ random noise variables ${ϵ_{t}^{*}, ϵ_{t}}_{t = 0}^{T} i i d \sim N (ϵ; 0, I)$ . Then, for an arbitrary sample $x_{t} \sim q (x_{t} | x_{0})$ , we can rewrite it as:

$x_{t}$	$= \sqrt{α_{t}} x_{t - 1} + \sqrt{1 - α_{t}} ϵ_{t - 1}^{*}$	(61)
	$= \sqrt{α_{t}} (\sqrt{α_{t - 1}} x_{t - 2} + \sqrt{1 - α_{t - 1}} ϵ_{t - 2}^{}) + \sqrt{1 - α_{t}} ϵ_{t - 1}^{}$	(62)
	$= \sqrt{α_{t} α_{t - 1}} x_{t - 2} + \sqrt{α_{t} - α_{t} α_{t - 1}} ϵ_{t - 2}^{} + \sqrt{1 - α_{t}} ϵ_{t - 1}^{}$	(63)
	$= \sqrt{α_{t} α_{t - 1}} x_{t - 2} + \sqrt{{\sqrt{α_{t} - α_{t} α_{t - 1}}}^{2} + {\sqrt{1 - α_{t}}}^{2}} ϵ_{t - 2}$	(64)
	$= \sqrt{α_{t} α_{t - 1}} x_{t - 2} + \sqrt{α_{t} - α_{t} α_{t - 1} + 1 - α_{t}} ϵ_{t - 2}$	(65)
	$= \sqrt{α_{t} α_{t - 1}} x_{t - 2} + \sqrt{1 - α_{t} α_{t - 1}} ϵ_{t - 2}$	(66)
	$= \dots$	(67)
	$= \sqrt{t \prod i = 1 α_{i}} x_{0} + \sqrt{1 - t \prod i = 1 α_{i}} ϵ_{0}$	(68)
	$= \sqrt{{¯ α}_{t}} x_{0} + \sqrt{1 - {¯ α}_{t}} ϵ_{0}$	(69)
	$\sim N (x_{t}; \sqrt{{¯ α}_{t}} x_{0}, (1 - {¯ α}_{t}) I)$	(70)

where in Equation 64 we have utilized the fact that the sum of two independent Gaussian random variables remains a Gaussian with mean being the sum of the two means, and variance being the sum of the two variances. Interpreting $\sqrt{1 - α_{t}} ϵ_{t - 1}^{*}$ as a sample from Gaussian $N (0, (1 - α_{t}) I)$ , and $\sqrt{α_{t} - α_{t} α_{t - 1}} ϵ_{t - 2}^{*}$ as a sample from Gaussian $N (0, (α_{t} - α_{t} α_{t - 1}) I)$ , we can then treat their sum as a random variable sampled from Gaussian $N (0, (1 - α_{t} + α_{t} - α_{t} α_{t - 1}) I) = N (0, (1 - α_{t} α_{t - 1}) I)$ . A sample from this distribution can then be represented using the reparameterization trick as $\sqrt{1 - α_{t} α_{t - 1}} ϵ_{t - 2}$ , as in Equation 66.

We have therefore derived the Gaussian form of $q (x_{t} | x_{0})$ . This derivation can be modified to also yield the Gaussian parameterization describing $q (x_{t - 1} | x_{0})$ . Now, knowing the forms of both $q (x_{t} | x_{0})$ and $q (x_{t - 1} | x_{0})$ , we can proceed to calculate the form of $q (x_{t - 1} | x_{t}, x_{0})$ by substituting into the Bayes rule expansion:

$q (x_{t - 1} \| x_{t}, x_{0})$	$=\scalebox0.94q(xt\|xt−1,x0)q(xt−1\|x0)q(xt\|x0)$	(71)
	$=\scalebox0.94N(xt;√αtxt−1,(1−αt)I)N(xt−1;√¯αt−1x0,(1−¯αt−1)I% )N(xt;√¯αtx0,(1−¯αt)I)$	(72)
	$∝\scalebox0.94exp{−[(xt−√αtxt−1)22(1−αt)+(xt−1−√¯αt−1x0)22(1−¯αt−1)−(xt−√¯αtx0)22(1−¯αt)]}$	(73)
	$=\scalebox0.94exp{−12[(xt−√αtxt−1)21−αt+(xt−1−√¯αt−1x0)21−¯αt−1−(xt−√¯αtx0)21−¯αt]}$	(74)
	$=\scalebox0.94exp{−12[(−2√αtxtxt−1+αtx2t−1)1−αt+(x2t−1−2√¯αt−1xt−1x0)1−¯αt−1+C(xt,x0)]}$	(75)
	$∝\scalebox0.94exp{−12[−2√αtxtxt−11−αt+αtx2t−11−αt+x2t−11−¯αt−1−2√¯αt−1xt−1x01−¯αt−1]}$	(76)
	$=\scalebox0.94exp{−12[(αt1−αt+11−¯αt−1)x2t−1−2(√αtxt1−αt+√¯αt−1x01−¯αt−1)xt−1]}$	(77)
	$=\scalebox0.94exp{−12[αt(1−¯αt−1)+1−αt(1−αt)(1−¯αt−1)x2t−1−2(√αtxt1−αt+√¯αt−1x01−¯αt−1)xt−1]}$	(78)
	$=\scalebox0.94exp{−12[αt−¯αt+1−αt(1−αt)(1−¯αt−1)x2t−1−2(√αtxt1−αt+√¯αt−1x01−¯αt−1)xt−1]}$	(79)
	$=\scalebox0.94exp{−12[1−¯αt(1−αt)(1−¯αt−1)x2t−1−2(√αtxt1−αt+√¯αt−1x01−¯αt−1)xt−1]}$	(80)
	$=\scalebox0.94exp⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩−12(1−¯αt(1−αt)(1−¯αt−1))⎡⎢ ⎢ ⎢ ⎢⎣x2t−1−2(√αtxt1−αt+√¯αt−1x01−¯αt−1)1−¯αt(1−αt)(1−¯αt−1)xt−1⎤⎥ ⎥ ⎥ ⎥⎦⎫⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪⎭$	(81)
	$=\scalebox0.94exp⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩−12(1−¯αt(1−αt)(1−¯αt−1))⎡⎢ ⎢ ⎢ ⎢⎣x2t−1−2(√αtxt1−αt+√¯αt−1x01−¯αt−1)(1−αt)(1−¯αt−1)1−¯αtxt−1⎤⎥ ⎥ ⎥ ⎥⎦⎫⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪⎭$	(82)
	$=\scalebox0.94exp⎧⎪ ⎪⎨⎪ ⎪⎩−12⎛⎜ ⎜⎝1(1−αt)(1−¯αt−1)1−¯αt⎞⎟ ⎟⎠[x2t−1−2√αt(1−¯αt−1)xt+√¯αt−1(1−αt)x01−¯αtxt−1]⎫⎪ ⎪⎬⎪ ⎪⎭$	(83)
	$∝\scalebox0.94N(xt−1;\scalebox0.94√αt(1−¯αt−1)xt+√¯αt−1(1−αt)x01−¯αtμq(xt,x0),\scalebox0.94(1−αt)(1−¯αt−1)1−¯αtIΣq(t))$	(84)

where in Equation 75, $C (x_{t}, x_{0})$ is a constant term with respect to $x_{t - 1}$ computed as a combination of only $x_{t}$ , $x_{0}$ , and $α$ values; this term is implicitly returned in Equation 84 to complete the square.

We have therefore shown that at each step, $x_{t - 1} \sim q (x_{t - 1} | x_{t}, x_{0})$ is normally distributed, with mean $μ_{q} (x_{t}, x_{0})$ that is a function of $x_{t}$ and $x_{0}$ , and variance $Σ_{q} (t)$ as a function of $α$ coefficients. These $α$ coefficients are known and fixed at each timestep; they are either set permanently when modeled as hyperparameters, or treated as the current inference output of a network that seeks to model them. Following Equation 84, we can rewrite our variance equation as $Σ_{q} (t) = σ_{q}^{2} (t) I$ , where:

σ_{q}^{2} (t) = \frac{(1 - α_{t}) (1 - {¯ α}_{t - 1})}{1 - {¯ α}_{t}}

(85)

In order to match approximate denoising transition step $p_{θ} (x_{t - 1} | x_{t})$ to ground-truth denoising transition step $q (x_{t - 1} | x_{t}, x_{0})$ as closely as possible, we can also model it as a Gaussian. Furthermore, as all $α$ terms are known to be frozen at each timestep, we can immediately construct the variance of the approximate denoising transition step to also be $Σ_{q} (t) = σ_{q}^{2} (t) I$ . We must parameterize its mean $μ_{θ} (x_{t}, t)$ as a function of $x_{t}$ , however, since $p_{θ} (x_{t - 1} | x_{t})$ does not condition on $x_{0}$ .

Recall that the KL Divergence between two Gaussian distributions is:

D_{KL} (N (x; μ_{x}, Σ_{x}) ∥ N (y; μ_{y}, Σ_{y}))

= \frac{1}{2} [log \frac{| Σ_{y} |}{| Σ_{x} |} - d + tr (Σ_{y}^{- 1} Σ_{x}) + (μ_{y} - μ_{x})^{T} Σ_{y}^{- 1} (μ_{y} - μ_{x})]

(86)

In our case, where we can set the variances of the two Gaussians to match exactly, optimizing the KL Divergence term reduces to minimizing the difference between the means of the two distributions:

	$arg min θ D_{KL} (q (x_{t - 1} \| x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} \| x_{t}))$
			(87)
	$= arg min θ \frac{1}{2} [log \frac{\| Σ_{q} (t) \|}{\| Σ_{q} (t) \|} - d + tr (Σ_{q} (t)^{- 1} Σ_{q} (t)) + (μ_{θ} - μ_{q})^{T} Σ_{q} (t)^{- 1} (μ_{θ} - μ_{q})]$		(88)
	$= arg min θ \frac{1}{2} [log 1 - d + d + (μ_{θ} - μ_{q})^{T} Σ_{q} (t)^{- 1} (μ_{θ} - μ_{q})]$		(89)
	$= arg min θ \frac{1}{2} [(μ_{θ} - μ_{q})^{T} Σ_{q} (t)^{- 1} (μ_{θ} - μ_{q})]$		(90)
	$= arg min θ \frac{1}{2} [(μ_{θ} - μ_{q})^{T} {(σ_{q}^{2} (t) % I)}^{- 1} (μ_{θ} - μ_{q})]$		(91)
			(92)

where we have written $μ_{q}$ as shorthand for $μ_{q} (x_{t}, x_{0})$ , and $μ_{θ}$ as shorthand for $μ_{θ} (x_{t}, t)$ for brevity. In other words, we want to optimize a $μ_{θ} (x_{t}, t)$ that matches $μ_{q} (x_{t}, x_{0})$ , which from our derived Equation 84, takes the form:

μ_{q} (x_{t}, x_{0}) = \frac{\sqrt{α_{t}} (1 - {¯ α}_{t - 1}) x_{t} + \sqrt{{¯ α}_{t - 1}} (1 - α_{t}) x_{0}}{1 - {¯ α}_{t}}

(93)

As $μ_{θ} (x_{t}, t)$ also conditions on $x_{t}$ , we can match $μ_{q} (x_{t}, x_{0})$ closely by setting it to the following form:

μ_{θ} (x_{t}, t) = \frac{\sqrt{α_{t}} (1 - {¯ α}_{t - 1}) x_{t} + \sqrt{{¯ α}_{t - 1}} (1 - α_{t}) {^x}_{θ} (x_{t}, t)}{1 - {¯ α}_{t}}

(94)

where ${^x}_{θ} (x_{t}, t)$ is parameterized by a neural network that seeks to predict $x_{0}$ from noisy image $x_{t}$ and time index $t$ . Then, the optimization problem simplifies to:


	$=\scalebox0.93argminθDKL(N(xt−1;μq,Σq(t))∥N(xt−1;μθ,Σq(t)))$		(95)
			(96)
	$=\scalebox0.93argminθ12σ2q(t)[∥∥∥√¯αt−1(1−αt)^xθ(xt,t)1−¯αt−√¯αt−1(1−αt)x01−¯αt∥∥∥22]$		(97)
	$=\scalebox0.93argminθ12σ2q(t)[∥∥∥√¯αt−1(1−αt)1−¯αt(^xθ(xt,t)−x0)∥∥∥22]$		(98)
	$=\scalebox0.93argminθ12σ2q(t)¯αt−1(1−αt)2(1−¯αt)2[∥^xθ(xt,t)−x0∥22]$		(99)

Therefore, optimizing a VDM boils down to learning a neural network to predict the original ground truth image from an arbitrarily noisified version of it [3]. Furthermore, minimizing the summation term of our derived ELBO objective (Equation 58) across all noise levels can be approximated by minimizing the expectation over all timesteps:

arg min θ E_{t \sim U {2, T}} [E_{q (x_{t} | x_{0})} [D_{% K L} (q (x_{t - 1} | x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} | x_{t}))]]

(100)

which can then be optimized using stochastic samples over timesteps.

Learning Diffusion Noise Parameters

Let us investigate how the noise parameters of a VDM can be jointly learned. One potential approach is to model $α_{t}$ using a neural network ${^α}_{η} (t)$ with parameters $η$ . However, this is inefficient as inference must be performed multiple times at each timestep $t$ to compute ${¯ α}_{t}$ . Whereas caching can mitigate this computational cost, we can also derive an alternate way to learn the diffusion noise parameters. By substituting our variance equation from Equation 85 into our derived per-timestep objective in Equation 99, we can reduce:

$\frac{1}{2 σ_{q}^{2} (t)} \frac{{¯ α}_{t - 1} (1 - α_{t})^{2}}{(1 - {¯ α}_{t})^{2}} [{∥ {^x}_{θ} (x_{t}, t) - x_{0} ∥}_{θ}^{2}]$	$=\scalebox0.912(1−αt)(1−¯αt−1)1−¯αt¯αt−1(1−αt)2(1−¯αt)2[∥^xθ(xt,t)−x0∥22]$	(101)
	$=\scalebox0.9121−¯αt(1−αt)(1−¯αt−1)¯αt−1(1−αt)2(1−¯αt)2[∥^xθ(xt,t)−x0∥22]$	(102)
		(103)
	$=\scalebox0.912¯αt−1−¯αt(1−¯αt−1)(1−¯αt)[∥^xθ(xt,t)−x0∥22]$	(104)
	$=\scalebox0.912¯αt−1−¯αt−1¯αt+¯αt−1¯αt−¯αt(1−¯αt−1)(1−¯αt)[∥^xθ(xt,t)−x0∥22]$	(105)
	$=\scalebox0.912¯αt−1(1−¯αt)−¯αt(1−¯αt−1)(1−¯αt−1)(1−¯αt)[∥^xθ(xt,t)−x0∥22]$	(106)
	$=\scalebox0.912(¯αt−1(1−¯αt)(1−¯αt−1)(1−¯αt)−¯αt(1−¯αt−1)(1−¯αt−1)(1−¯αt))[∥^xθ(xt,t)−x0∥22]$	(107)
	$=\scalebox0.912(¯αt−11−¯αt−1−¯αt1−¯αt)[∥^xθ(xt,t)−x0∥22]$	(108)

Recall from Equation 70 that $q (x_{t} | x_{0})$ is a Gaussian of form $N (x_{t}; \sqrt{{¯ α}_{t}} x_{0}, (1 - {¯ α}_{t}) I)$ . Then, following the definition of the signal-to-noise ratio (SNR) as $SNR = \frac{μ^{2}}{σ^{2}}$ , we can write the SNR at each timestep $t$ as:

SNR (t)

= \frac{{¯ α}_{t}}{1 - {¯ α}_{t}}

(109)

Then, our derived Equation 108 (and Equation 99) can be simplified as:

\frac{1}{2 σ_{q}^{2} (t)} \frac{{¯ α}_{t - 1} (1 - α_{t})^{2}}{(1 - {¯ α}_{t})^{2}} [{∥ {^x}_{θ} (x_{t}, t) - x_{0} ∥}_{θ}^{2}]

(110)

As the name implies, the SNR represents the ratio between the original signal and the amount of noise present; a higher SNR represents more signal and a lower SNR represents more noise. In a diffusion model, we require the SNR to monotonically decrease as timestep $t$ increases; this formalizes the notion that perturbed input $x_{t}$ becomes increasingly noisy over time, until it becomes identical to a standard Gaussian at $t = T$ .

Following the simplification of the objective in Equation 110, we can directly parameterize the SNR at each timestep using a neural network, and learn it jointly along with the diffusion model. As the SNR must monotonically decrease over time, we can represent it as:

SNR (t) = exp (- ω_{η} (t))

(111)

where $ω_{η} (t)$ is modeled as a monotonically increasing neural network with parameters $η$ . Negating $ω_{η} (t)$ results in a monotonically decreasing function, whereas the exponential forces the resulting term to be positive. Note that the objective in Equation 100 must now optimize over $η$ as well. By combining our parameterization of SNR in Equation 111 with our definition of SNR in Equation 109, we can also explicitly derive elegant forms for the value of ${¯ α}_{t}$ as well as for the value of $1 - {¯ α}_{t}$ :

	$\frac{{¯ α}_{t}}{1 - {¯ α}_{t}} = exp (- ω_{η} (t))$		(112)
	$∴ {¯ α}_{t} = sigmoid (- ω_{η} (t))$		(113)
	$∴ 1 - {¯ α}_{t} = sigmoid (ω_{η} (t))$		(114)

These terms are necessary for a variety of computations; for example, during optimization, they are used to create arbitrarily noisy $x_{t}$ from input $x_{0}$ using the reparameterization trick, as derived in Equation 69.

Three Equivalent Interpretations

As we previously proved, a Variational Diffusion Model can be trained by simply learning a neural network to predict the original natural image $x_{0}$ from an arbitrary noised version $x_{t}$ and its time index $t$ . However, $x_{0}$ has two other equivalent parameterizations, which leads to two further interpretations for a VDM.

Firstly, we can utilize the reparameterization trick. In our derivation of the form of $q (x_{t} | x_{0})$ , we can rearrange Equation 69 to show that:

x_{0}

= \frac{x_{t} - \sqrt{1 - {¯ α}_{t}} ϵ_{0}}{\sqrt{{¯ α}_{t}}}

(115)

Plugging this into our previously derived true denoising transition mean $μ_{q} (x_{t}, x_{0})$ , we can rederive as:

$μ_{q} (x_{t}, x_{0})$	$= \frac{\sqrt{α_{t}} (1 - {¯ α}_{t - 1}) x_{t} + \sqrt{{¯ α}_{t - 1}} (1 - α_{t}) x_{0}}{1 - {¯ α}_{t}}$	(116)
	$= \frac{\sqrt{α_{t}} (1 - {¯ α}_{t - 1}) x_{t} + \sqrt{{¯ α}_{t - 1}} (1 - α_{t}) \frac{x_{t} - \sqrt{1 - {¯ α}_{t}} ϵ_{0}}{\sqrt{{¯ α}_{t}}}}{1 - {¯ α}_{t}}$	(117)
	$= \frac{\sqrt{α_{t}} (1 - {¯ α}_{t - 1}) x_{t} + (1 - α_{t}) \frac{x_{t} - \sqrt{1 - {¯ α}_{t}} ϵ_{0}}{\sqrt{α_{t}}}}{1 - {¯ α}_{t}}$	(118)
	$= \frac{\sqrt{α_{t}} (1 - {¯ α}_{t - 1}) x_{t}}{1 - {¯ α}_{t}} + \frac{(1 - α_{t}) x_{t}}{(1 - {¯ α}_{t}) \sqrt{α_{t}}} - \frac{(1 - α_{t}) \sqrt{1 - {¯ α}_{t}} ϵ_{0}}{(1 - {¯ α}_{t}) \sqrt{α_{t}}}$	(119)
	$= (\frac{\sqrt{α_{t}} (1 - {¯ α}_{t - 1})}{1 - {¯ α}_{t}} + \frac{1 - α_{t}}{(1 - {¯ α}_{t}) \sqrt{α_{t}}}) x_{t} - \frac{(1 - α_{t}) \sqrt{1 - {¯ α}_{t}}}{(1 - {¯ α}_{t}) \sqrt{α_{t}}} ϵ_{0}$	(120)
	$= (\frac{α_{t} (1 - {¯ α}_{t - 1})}{(1 - {¯ α}_{t}) \sqrt{α_{t}}} + \frac{1 - α_{t}}{(1 - {¯ α}_{t}) \sqrt{α_{t}}}) x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}} \sqrt{α_{t}}} ϵ_{0}$	(121)
	$= \frac{α_{t} - {¯ α}_{t} + 1 - α_{t}}{(1 - {¯ α}_{t}) \sqrt{α_{t}}} x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}} \sqrt{α_{t}}} ϵ_{0}$	(122)
	$= \frac{1 - {¯ α}_{t}}{(1 - {¯ α}_{t}) \sqrt{α_{t}}} x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}} \sqrt{α_{t}}} ϵ_{0}$	(123)
	$= \frac{1}{\sqrt{α_{t}}} x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}} \sqrt{α_{t}}} ϵ_{0}$	(124)

Therefore, we can set our approximate denoising transition mean $μ_{θ} (x_{t}, t)$ as:

μ_{θ} (x_{t}, t)

= \frac{1}{\sqrt{α_{t}}} x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}} \sqrt{α_{t}}} {^ϵ}_{θ} (x_{t}, t)

(125)

and the corresponding optimization problem becomes:

	$arg min θ D_{KL} (q (x_{t - 1} \| x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} \| x_{t}))$
			(126)
	$= arg min θ \frac{1}{2 σ_{q}^{2} (t)} ⎡ ⎣ {∥ ∥ ∥ ∥ \frac{1}{\sqrt{α_{t}}} x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}} \sqrt{α_{t}}} {^ϵ}_{θ} (x_{t}, t) - \frac{1}{\sqrt{α_{t}}} x_{t} + \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}} \sqrt{α_{t}}} ϵ_{0} ∥ ∥ ∥ ∥}_{t}^{2} ⎤ ⎦$		(127)
			(128)
	$= arg min θ \frac{1}{2 σ_{q}^{2} (t)} ⎡ ⎣ {∥ ∥ ∥ ∥ \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}} \sqrt{α_{t}}} (ϵ_{0} - {^ϵ}_{θ} (x_{t}, t)) ∥ ∥ ∥ ∥}_{0}^{2} ⎤ ⎦$		(129)
	$= arg min θ \frac{1}{2 σ_{q}^{2} (t)} \frac{(1 - α_{t})^{2}}{(1 - {¯ α}_{t}) α_{t}} [{∥ ϵ_{0} - {^ϵ}_{θ} (x_{t}, t) ∥}_{0}^{2}]$		(130)

Here, ${^ϵ}_{θ} (x_{t}, t)$ is a neural network that learns to predict the source noise $ϵ_{0} \sim N (ϵ; 0, I)$ that determines $x_{t}$ from $x_{0}$ . We have therefore shown that learning a VDM by predicting the original image $x_{0}$ is equivalent to learning to predict the noise; empirically, however, some works have found that predicting the noise resulted in better performance [3, 12].

To derive the third common interpretation of Variational Diffusion Models, we appeal to Tweedie’s Formula [2]. In English, Tweedie’s Formula states that the true mean of an exponential family distribution, given samples drawn from it, can be estimated by the maximum likelihood estimate of the samples (aka empirical mean) plus some correction term involving the score of the estimate. In the case of just one observed sample, the empirical mean is just the sample itself. It is commonly used to mitigate sample bias; if observed samples all lie on one end of the underlying distribution, then the negative score becomes large and corrects the naive maximum likelihood estimate of the samples towards the true mean.

Mathematically, for a Gaussian variable $z \sim N (z; μ_{z}, Σ_{z})$ , Tweedie’s Formula states that:

E [μ_{z} | z] = z + Σ_{z} \nabla_{z} log p (z)

In this case, we apply it to predict the true posterior mean of $x_{t}$ given its samples. From Equation 70, we know that:

q (x_{t} | x_{0}) = N (x_{t}; \sqrt{{¯ α}_{t}} x_{0}, (1 - {¯ α}_{t}) I)

Then, by Tweedie’s Formula, we have:

E [μ_{x_{t}} | x_{t}] = x_{t} + (1 - {¯ α}_{t}) \nabla_{x_{t}} log p (x_{t})

(131)

where we write $\nabla_{x_{t}} log p (x_{t})$ as $\nabla log p (x_{t})$ for notational simplicity. According to Tweedie’s Formula, the best estimate for the true mean that $x_{t}$ is generated from, $μ_{x_{t}} = \sqrt{{¯ α}_{t}} x_{0}$ , is defined as:

	$\sqrt{{¯ α}_{t}} x_{0} = x_{t} + (1 - {¯ α}_{t}) \nabla log p (x_{t})$		(132)
	$∴ x_{0} = \frac{x_{t} + (1 - {¯ α}_{t}) \nabla log p (x_{t})}{\sqrt{{¯ α}_{t}}}$		(133)

Then, we can plug Equation 133 into our ground-truth denoising transition mean $μ_{q} (x_{t}, x_{0})$ once again and derive a new form:

$μ_{q} (x_{t}, x_{0})$	$= \frac{\sqrt{α_{t}} (1 - {¯ α}_{t - 1}) x_{t} + \sqrt{{¯ α}_{t - 1}} (1 - α_{t}) x_{0}}{1 - {¯ α}_{t}}$	(134)
	$= \frac{\sqrt{α_{t}} (1 - {¯ α}_{t - 1}) x_{t} + \sqrt{{¯ α}_{t - 1}} (1 - α_{t}) \frac{x_{t} + (1 - {¯ α}_{t}) \nabla log p (x_{t})}{\sqrt{{¯ α}_{t}}}}{1 - {¯ α}_{t}}$	(135)
	$= \frac{\sqrt{α_{t}} (1 - {¯ α}_{t - 1}) x_{t} + (1 - α_{t}) \frac{x_{t} + (1 - {¯ α}_{t}) \nabla log p (x_{t})}{\sqrt{α_{t}}}}{1 - {¯ α}_{t}}$	(136)
	$= \frac{\sqrt{α_{t}} (1 - {¯ α}_{t - 1}) x_{t}}{1 - {¯ α}_{t}} + \frac{(1 - α_{t}) x_{t}}{(1 - {¯ α}_{t}) \sqrt{α_{t}}} + \frac{(1 - α_{t}) (1 - {¯ α}_{t}) \nabla log p (x_{t})}{(1 - {¯ α}_{t}) \sqrt{α_{t}}}$	(137)
	$= (\frac{\sqrt{α_{t}} (1 - {¯ α}_{t - 1})}{1 - {¯ α}_{t}} + \frac{1 - α_{t}}{(1 - {¯ α}_{t}) \sqrt{α_{t}}}) x_{t} + \frac{1 - α_{t}}{\sqrt{α_{t}}} \nabla log p (x_{t})$	(138)
	$= (\frac{α_{t} (1 - {¯ α}_{t - 1})}{(1 - {¯ α}_{t}) \sqrt{α_{t}}} + \frac{1 - α_{t}}{(1 - {¯ α}_{t}) \sqrt{α_{t}}}) x_{t} + \frac{1 - α_{t}}{\sqrt{α_{t}}} \nabla log p (x_{t})$	(139)
	$= \frac{α_{t} - {¯ α}_{t} + 1 - α_{t}}{(1 - {¯ α}_{t}) \sqrt{α_{t}}} x_{t} + \frac{1 - α_{t}}{\sqrt{α_{t}}} \nabla log p (x_{t})$	(140)
	$= \frac{1 - {¯ α}_{t}}{(1 - {¯ α}_{t}) \sqrt{α_{t}}} x_{t} + \frac{1 - α_{t}}{\sqrt{α_{t}}} \nabla log p (x_{t})$	(141)
	$= \frac{1}{\sqrt{α_{t}}} x_{t} + \frac{1 - α_{t}}{\sqrt{α_{t}}} \nabla log p (x_{t})$	(142)

Therefore, we can also set our approximate denoising transition mean $μ_{θ} (x_{t}, t)$ as:

μ_{θ} (x_{t}, t)

= \frac{1}{\sqrt{α_{t}}} x_{t} + \frac{1 - α_{t}}{\sqrt{α_{t}}} s_{θ} (x_{t}, t)

(143)

and the corresponding optimization problem becomes:

	$arg min θ D_{KL} (q (x_{t - 1} \| x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} \| x_{t}))$
			(144)
	$= arg min θ \frac{1}{2 σ_{q}^{2} (t)} [{∥ ∥ ∥ \frac{1}{\sqrt{α_{t}}} x_{t} + \frac{1 - α_{t}}{\sqrt{α_{t}}} s_{θ} (x_{t}, t) - \frac{1}{\sqrt{α_{t}}} x_{t} - \frac{1 - α_{t}}{\sqrt{α_{t}}} \nabla log p (x_{t}) ∥ ∥ ∥}_{t}^{2}]$		(145)
	$= arg min θ \frac{1}{2 σ_{q}^{2} (t)} [{∥ ∥ ∥ \frac{1 - α_{t}}{\sqrt{α_{t}}} s_{θ} (x_{t}, t) - \frac{1 - α_{t}}{\sqrt{α_{t}}} \nabla log p (x_{t}) ∥ ∥ ∥}_{θ}^{2}]$		(146)
	$= arg min θ \frac{1}{2 σ_{q}^{2} (t)} [{∥ ∥ ∥ \frac{1 - α_{t}}{\sqrt{α_{t}}} (s_{θ} (x_{t}, t) - \nabla log p (x_{t})) ∥ ∥ ∥}_{θ}^{2}]$		(147)
	$= arg min θ \frac{1}{2 σ_{q}^{2} (t)} \frac{(1 - α_{t})^{2}}{α_{t}} [{∥ s_{θ} (x_{t}, t) - \nabla log p (x_{t}) ∥}_{θ}^{2}]$		(148)

Here, $s_{θ} (x_{t}, t)$ is a neural network that learns to predict the score function $\nabla_{x_{t}} log p (x_{t})$ , which is the gradient of $x_{t}$ in data space, for any arbitrary noise level $t$ .

The astute reader will notice that the score function $\nabla log p (x_{t})$ looks remarkably similar in form to the source noise $ϵ_{0}$ . This can be shown explicitly by combining Tweedie’s Formula (Equation 133) with the reparameterization trick (Equation 115):

$x_{0} = \frac{x_{t} + (1 - {¯ α}_{t}) \nabla log p (x_{t})}{\sqrt{{¯ α}_{t}}}$	$= \frac{x_{t} - \sqrt{1 - {¯ α}_{t}} ϵ_{0}}{\sqrt{{¯ α}_{t}}}$	(149)
$∴ (1 - {¯ α}_{t}) \nabla log p (x_{t})$	$= - \sqrt{1 - {¯ α}_{t}} ϵ_{0}$	(150)
$\nabla log p (x_{t})$	$= - \frac{1}{\sqrt{1 - {¯ α}_{t}}} ϵ_{0}$	(151)

As it turns out, the two terms are off by a constant factor that scales with time! The score function measures how to move in data space to maximize the log probability; intuitively, since the source noise is added to a natural image to corrupt it, moving in its opposite direction "denoises" the image and would be the best update to increase the subsequent log probability. Our mathematical proof justifies this intuition; we have explicitly shown that learning to model the score function is equivalent to modeling the negative of the source noise (up to a scaling factor).

We have therefore derived three equivalent objectives to optimize a VDM: learning a neural network to predict the original image $x_{0}$ , the source noise $ϵ_{0}$ , or the score of the image at an arbitrary noise level $\nabla log p (x_{t})$ . The VDM can be scalably trained by stochastically sampling timesteps $t$ and minimizing the norm of the prediction with the ground truth target.

Score-based Generative Models

We have shown that a Variational Diffusion Model can be learned simply by optimizing a neural network $s_{θ} (x_{t}, t)$ to predict the score function $\nabla log p (x_{t})$ . However, in our derivation, the score term arrived from an application of Tweedie’s Formula; this doesn’t necessarily provide us with great intuition or insight into what exactly the score function is or why it is worth modeling. Fortunately, we can look to another class of generative models, Score-based Generative Models [16, 20, 17], for exactly this intuition. As it turns out, we can show that the VDM formulation we have previously derived has an equivalent Score-based Generative Modeling formulation, allowing us to flexibly switch between these two interpretations at will.

To begin to understand why optimizing a score function makes sense, we take a detour and revisit energy-based models [10, 19]. Arbitrarily flexible probability distributions can be written in the form:

p_{θ} (x) = \frac{1}{Z_{θ}} e^{- f_{θ} (x)}

(152)

where $f_{θ} (x)$ is an arbitrarily flexible, parameterizable function called the energy function, often modeled by a neural network, and $Z_{θ}$ is a normalizing constant to ensure that $\int p_{θ} (x) d x = 1$ . One way to learn such a distribution is maximum likelihood; however, this requires tractably computing the normalizing constant $Z_{θ} = \int e^{- f_{θ} (x)} d x$ , which may not be possible for complex $f_{θ} (x)$ functions.

Figure 6: Visualization of three random sampling trajectories generated with Langevin dynamics, all starting from the same initialization point, for a Mixture of Gaussians. The left figure plots these sampling trajectories on a three-dimensional contour, while the right figure plots the sampling trajectories against the ground-truth score function. From the same initialization point, we are able to generate samples from different modes due to the stochastic noise term in the Langevin dynamics sampling procedure; without it, sampling from a fixed point would always deterministically follow the score to the same mode every trial.

One way to avoid calculating or modeling the normalization constant is by using a neural network $s_{θ} (x)$ to learn the score function $\nabla log p (x)$ of distribution $p (x)$ instead. This is motivated by the observation that taking the derivative of the log of both sides of Equation 152 yields:

$\nabla_{x} log p_{θ} (x)$	$= \nabla_{x} log (\frac{1}{Z_{θ}} e^{- f_{θ} (x)})$	(153)
	$= \nabla_{x} log \frac{1}{Z_{θ}} + \nabla_{x} log e^{- f_{θ} (x)}$	(154)
	$= - \nabla_{x} f_{θ} (x)$	(155)
	$\approx s_{θ} (x)$	(156)

which can be freely represented as a neural network without involving any normalization constants. The score model can be optimized by minimizing the Fisher Divergence with the ground truth score function:

E_{p (x)} [{∥ s_{θ} (x) - \nabla log p (x) ∥}_{θ}^{2}]

(157)

What does the score function represent? For every $x$ , taking the gradient of its log likelihood with respect to $x$ essentially describes what direction in data space to move in order to further increase its likelihood. Intuitively, then, the score function defines a vector field over the entire space that data $x$ inhabits, pointing towards the modes. Visually, this is depicted in the right plot of Figure 6. Then, by learning the score function of the true data distribution, we can generate samples by starting at any arbitrary point in the same space and iteratively following the score until a mode is reached. This sampling procedure is known as Langevin dynamics, and is mathematically described as:

x_{i + 1} \leftarrow x_{i} + c \nabla log p (x_{i}) + \sqrt{2 c} ϵ, i = 0, 1, . . ., K

(158)

where $x_{0}$ is randomly sampled from a prior distribution (such as uniform), and $ϵ \sim N (ϵ; 0, I)$ is an extra noise term to ensure that the generated samples do not always collapse onto a mode, but hover around it for diversity. Furthermore, because the learned score function is deterministic, sampling with a noise term involved adds stochasticity to the generative process, allowing us to avoid deterministic trajectories. This is particularly useful when sampling is initialized from a position that lies between multiple modes. A visual depiction of Langevin dynamics sampling and the benefits of the noise term is shown in Figure 6.

Note that the objective in Equation 157 relies on having access to the ground truth score function, which is unavailable to us for complex distributions such as the one modeling natural images. Fortunately, alternative techniques known as score matching [6, 13, 18, 21] have been derived to minimize this Fisher divergence without knowing the ground truth score, and can be optimized with stochastic gradient descent.

Collectively, learning to represent a distribution as a score function and using it to generate samples through Markov Chain Monte Carlo techniques, such as Langevin dynamics, is known as Score-based Generative Modeling [16, 20, 17].

There are three main problems with vanilla score matching, as detailed by Song and Ermon [16]. Firstly, the score function is ill-defined when $x$ lies on a low-dimensional manifold in a high-dimensional space. This can be seen mathematically; all points not on the low-dimensional manifold would have probability zero, the log of which is undefined. This is particularly inconvenient when trying to learn a generative model over natural images, which is known to lie on a low-dimensional manifold of the entire ambient space.

Secondly, the estimated score function trained via vanilla score matching will not be accurate in low density regions. This is evident from the objective we minimize in Equation 157. Because it is an expectation over $p (x)$ , and explicitly trained on samples from it, the model will not receive an accurate learning signal for rarely seen or unseen examples. This is problematic, since our sampling strategy involves starting from a random location in the high-dimensional space, which is most likely random noise, and moving according to the learned score function. Since we are following a noisy or inaccurate score estimate, the final generated samples may be suboptimal as well, or require many more iterations to converge on an accurate output.

Lastly, Langevin dynamics sampling may not mix, even if it is performed using the ground truth scores. Suppose that the true data distribution is a mixture of two disjoint distributions:

p (x) = c_{1} p_{1} (x) + c_{2} p_{2} (x)

(159)

Then, when the score is computed, these mixing coefficients are lost, since the log operation splits the coefficient from the distribution and the gradient operation zeros it out. To visualize this, note that the ground truth score function shown in the right Figure 6 is agnostic of the different weights between the three distributions; Langevin dynamics sampling from the depicted initialization point has a roughly equal chance of arriving at each mode, despite the bottom right mode having a higher weight in the actual Mixture of Gaussians.

It turns out that these three drawbacks can be simultaneously addressed by adding multiple levels of Gaussian noise to the data. Firstly, as the support of a Gaussian noise distribution is the entire space, a perturbed data sample will no longer be confined to a low-dimensional manifold. Secondly, adding large Gaussian noise will increase the area each mode covers in the data distribution, adding more training signal in low density regions. Lastly, adding multiple levels of Gaussian noise with increasing variance will result in intermediate distributions that respect the ground truth mixing coefficients.

Formally, we can choose a positive sequence of noise levels ${σ_{t}}_{t = 1}^{T}$ and define a sequence of progressively perturbed data distributions:

p_{σ_{t}} (x_{t}) = \int p (x) N (x_{t}; x, σ_{t}^{2} I) d x

(160)

Then, a neural network $s_{θ} (x, t)$ is learned using score matching to learn the score function for all noise levels simultaneously:

arg min θ T \sum t = 1 λ (t) E_{p_{σ_{t}} (x_{t})} [{∥ s_{θ} (x, t) - \nabla log p_{σ_{t}} (x_{t}) ∥}_{θ}^{2}]

(161)

where $λ (t)$ is a positive weighting function that conditions on noise level $t$ . Note that this objective almost exactly matches the objective derived in Equation 148 to train a Variational Diffusion Model. Furthermore, the authors propose annealed Langevin dynamics sampling as a generative procedure, in which samples are produced by running Langevin dynamics for each $t = T, T - 1, . . ., 2, 1$ in sequence. The initialization is chosen from some fixed prior (such as uniform), and each subsequent sampling step starts from the final samples of the previous simulation. Because the noise levels steadily decrease over timesteps $t$ , and we reduce the step size over time, the samples eventually converge into a true mode. This is directly analogous to the sampling procedure performed in the Markovian HVAE interpretation of a Variational Diffusion Model, where a randomly initialized data vector is iteratively refined over decreasing noise levels.

Therefore, we have established an explicit connection between Variational Diffusion Models and Score-based Generative Models, both in their training objectives and sampling procedures.

One question is how to naturally generalize diffusion models to an infinite number of timesteps. Under the Markovian HVAE view, this can be interpreted as extending the number of hierarchies to infinity $T \to \infty$ . It is clearer to represent this from the equivalent score-based generative model perspective; under an infinite number of noise scales, the perturbation of an image over continuous time can be represented as a stochastic process, and therefore described by a stochastic differential equation (SDE). Sampling is then performed by reversing the SDE, which naturally requires estimating the score function at each continuous-valued noise level [20]. Different parameterizations of the SDE essentially describe different perturbation schemes over time, enabling flexible modeling of the noising procedure [8].

Guidance

So far, we have focused on modeling just the data distribution $p (x)$ . However, we are often also interested in learning conditional distribution $p (x | y)$ , which would enable us to explicitly control the data we generate through conditioning information $y$ . This forms the backbone of image super-resolution models such as Cascaded Diffusion Models [4], as well as state-of-the-art image-text models such as DALL-E 2 [11] and Imagen [12].

A natural way to add conditioning information is simply alongside the timestep information, at each iteration. Recall our joint distribution from Equation 32:

p (x_{0 : T}) = p (x_{T}) T \prod t = 1 p_{θ} (x_{t - 1} | x_{t})

Then, to turn this into a conditional diffusion model, we can simply add arbitrary conditioning information $y$ at each transition step as:

p (x_{0 : T} | y) = p (x_{T}) T \prod t = 1 p_{θ} (x_{t - 1} | x_{t}, y)

(162)

For example, $y$ could be a text encoding in image-text generation, or a low-resolution image to perform super-resolution on. We are thus able to learn the core neural networks of a VDM as before, by predicting ${^x}_{θ} (x_{t}, t, y) \approx x_{0}$ , ${^ϵ}_{θ} (x_{t}, t, y) \approx ϵ_{0}$ , or $s_{θ} (x_{t}, t, y) \approx \nabla log p (x_{t} | y)$ for each desired interpretation and implementation.

A caveat of this vanilla formulation is that a conditional diffusion model trained in this way may potentially learn to ignore or downplay any given conditioning information. Guidance is therefore proposed as a way to more explicitly control the amount of weight the model gives to the conditioning information, at the cost of sample diversity. The two most popular forms of guidance are known as Classifier Guidance [20, 1] and Classifier-Free Guidance [5].

Classifier Guidance

Let us begin with the score-based formulation of a diffusion model, where our goal is to learn $\nabla log p (x_{t} | y)$ , the score of the conditional model, at arbitrary noise levels $t$ . Recall that $\nabla$ is shorthand for $\nabla_{x_{t}}$ in the interest of brevity. By Bayes rule, we can derive the following equivalent form:

$\nabla log p (x_{t} \| y)$	$= \nabla log (\frac{p (x_{t}) p (y \| x_{t})}{p (y)})$	(163)
	$= \nabla log p (x_{t}) + \nabla log p (y \| x_{t}) - \nabla log p (y)$	(164)
	$= \nabla log p (x_{t})      unconditional % score + \nabla log p (y \| x_{t})      adversarial gradient$	(165)

where we have leveraged the fact that the gradient of $log p (y)$ with respect to $x_{t}$ is zero.

Our final derived result can be interpreted as learning an unconditional score function combined with the adversarial gradient of a classifier $p (y | x_{t})$ . Therefore, in Classifier Guidance [20, 1], the score of an unconditional diffusion model is learned as previously derived, alongside a classifier that takes in arbitrary noisy $x_{t}$ and attempts to predict conditional information $y$ . Then, during the sampling procedure, the overall conditional score function used for annealed Langevin dynamics is computed as the sum of the unconditional score function and the adversarial gradient of the noisy classifier.

In order to introduce fine-grained control to either encourage or discourage the model to consider the conditioning information, Classifier Guidance scales the adversarial gradient of the noisy classifier by a $γ$ hyperparameter term. The score function learned under Classifier Guidance can then be summarized as:

\nabla log p (x_{t} | y)

= \nabla log p (x_{t}) + γ \nabla log p (y | x_{t})

(166)

Intuitively, when $γ = 0$ the conditional diffusion model learns to ignore the conditioning information entirely, and when $γ$ is large the conditional diffusion model learns to produce samples that heavily adhere to the conditioning information. This would come at the cost of sample diversity, as it would only produce data that would be easy to regenerate the provided conditioning information from, even at noisy levels.

One noted drawback of Classifier Guidance is its reliance on a separately learned classifier. Because the classifier must handle arbitrarily noisy inputs, which most existing pretrained classification models are not optimized to do, it must be learned ad hoc alongside the diffusion model.

Classifier-Free Guidance

In Classifier-Free Guidance [5], the authors ditch the training of a separate classifier model in favor of an unconditional diffusion model and a conditional diffusion model. To derive the score function under Classifier-Free Guidance, we can first rearrange Equation 165 to show that:

\nabla log p (y | x_{t}) = \nabla log p (x_{t} | y) - \nabla log p (x_{t})

(167)

Then, substituting this into Equation 166, we get:

$\nabla log p (x_{t} \| y)$	$= \nabla log p (x_{t}) + γ (\nabla log p (x_{t} \| y) - \nabla log p (x_{t}))$	(168)
	$= \nabla log p (x_{t}) + γ \nabla log p (x_{t} \| y) - γ \nabla log p (x_{t})$	(169)
	$= γ \nabla log p (x_{t} \| y)      % conditional score + (1 - γ) \nabla log p (x_{t})      % unconditional score$	(170)

Once again, $γ$ is a term that controls how much our learned conditional model cares about the conditioning information. When $γ = 0$ , the learned conditional model completely ignores the conditioner and learns an unconditional diffusion model. When $γ = 1$ , the model explicitly learns the vanilla conditional distribution without guidance. When $γ > 1$ , the diffusion model not only prioritizes the conditional score function, but also moves in the direction away from the unconditional score function. In other words, it reduces the probability of generating samples that do not use conditioning information, in favor of the samples that explicitly do. This also has the effect of decreasing sample diversity at the cost of generating samples that accurately match the conditioning information.

Because learning two separate diffusion models is expensive, we can learn both the conditional and unconditional diffusion models together as a singular conditional model; the unconditional diffusion model can be queried by replacing the conditioning information with fixed constant values, such as zeros. This is essentially performing random dropout on the conditioning information. Classifier-Free Guidance is elegant because it enables us greater control over our conditional generation procedure while requiring nothing beyond the training of a singular diffusion model.

Closing

Allow us to recapitulate our findings over the course of our explorations. First, we derive Variational Diffusion Models as a special case of a Markovian Hierarchical Variational Autoencoder, where three key assumptions enable tractable computation and scalable optimization of the ELBO. We then prove that optimizing a VDM boils down to learning a neural network to predict one of three potential objectives: the original source image from any arbitrary noisification of it, the original source noise from any arbitrarily noisified image, or the score function of a noisified image at any arbitrary noise level. Then, we dive deeper into what it means to learn the score function, and connect it explicitly with the perspective of Score-based Generative Modeling. Lastly, we cover how to learn a conditional distribution using diffusion models.

In summary, diffusion models have shown incredible capabilities as generative models; indeed, they power the current state-of-the-art models on text-conditioned image generation such as Imagen and DALL-E 2. Furthermore, the mathematics that enable these models are exceedingly elegant. However, there still remain a few drawbacks to consider:

It is unlikely that this is how we, as humans, naturally model and generate data; we do not generate samples as random noise that we iteratively denoise.
The VDM does not produce interpretable latents. Whereas a VAE would hopefully learn a structured latent space through the optimization of its encoder, in a VDM the encoder at each timestep is already given as a linear Gaussian model and cannot be optimized flexibly. Therefore, the intermediate latents are restricted as just noisy versions of the original input.
The latents are restricted to the same dimensionality as the original input, further frustrating efforts to learn meaningful, compressed latent structure.
Sampling is an expensive procedure, as multiple denoising steps must be run under both formulations. Recall that one of the restrictions is that a large enough number of timesteps $T$ is chosen to ensure the final latent is completely Gaussian noise; during sampling we must iterate over all these timesteps to generate a sample.

As a final note, the success of diffusion models highlights the power of Hierarchical VAEs as a generative model. We have shown that when we generalize to infinite latent hierarchies, even if the encoder is trivial and the latent dimension is fixed and Markovian transitions are assumed, we are still able to learn powerful models of data. This suggests that further performance gains can be achieved in the case of general, deep HVAEs, where complex encoders and semantically meaningful latent spaces can be potentially learned.

Acknowledgments: I would like to acknowledge Josh Dillon, Yang Song, Durk Kingma, Ben Poole, Jonathan Ho, Yiding Jiang, Ting Chen, Jeremy Cohen, and Chen Sun for reviewing drafts of this work and providing many helpful edits and comments. Thanks so much!

References

[1] P. Dhariwal and A. Nichol (2021) Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems 34, pp. 8780–8794. Cited by: Classifier Guidance, Guidance.
[2] B. Efron (2011) Tweedie’s formula and selection bias. Journal of the American Statistical Association 106 (496), pp. 1602–1614. Cited by: Three Equivalent Interpretations.
[3] J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33, pp. 6840–6851. Cited by: Three Equivalent Interpretations, Variational Diffusion Models, Variational Diffusion Models, Variational Diffusion Models.
[4] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans (2022) Cascaded diffusion models for high fidelity image generation.. J. Mach. Learn. Res. 23, pp. 47–1. Cited by: Guidance.
[5] J. Ho and T. Salimans (2021) Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, Cited by: Classifier-Free Guidance, Guidance.
[6] A. Hyvärinen and P. Dayan (2005) Estimation of non-normalized statistical models by score matching.. Journal of Machine Learning Research 6 (4). Cited by: Score-based Generative Models.
[7] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: Variational Autoencoders.
[8] D. Kingma, T. Salimans, B. Poole, and J. Ho (2021) Variational diffusion models. Advances in neural information processing systems 34, pp. 21696–21707. Cited by: Variational Diffusion Models, Variational Diffusion Models, Score-based Generative Models.
[9] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. Advances in neural information processing systems 29. Cited by: Hierarchical Variational Autoencoders.
[10] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang (2006) A tutorial on energy-based learning. Predicting structured data 1 (0). Cited by: Score-based Generative Models.
[11] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125. Cited by: Guidance.
[12] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al. (2022) Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487. Cited by: Three Equivalent Interpretations, Guidance.
[13] S. Saremi, A. Mehrjou, B. Schölkopf, and A. Hyvärinen (2018) Deep energy estimator networks. arXiv preprint arXiv:1805.08306. Cited by: Score-based Generative Models.
[14] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. Cited by: Variational Diffusion Models.
[15] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther (2016) Ladder variational autoencoders. Advances in neural information processing systems 29. Cited by: Hierarchical Variational Autoencoders.
[16] Y. Song and S. Ermon (2019) Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems 32. Cited by: Score-based Generative Models, Score-based Generative Models, Score-based Generative Models.
[17] Y. Song and S. Ermon (2020) Improved techniques for training score-based generative models. Advances in neural information processing systems 33, pp. 12438–12448. Cited by: Score-based Generative Models, Score-based Generative Models.
[18] Y. Song, S. Garg, J. Shi, and S. Ermon (2020) Sliced score matching: a scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pp. 574–584. Cited by: Score-based Generative Models.
[19] Y. Song and D. P. Kingma (2021) How to train your energy-based models. arXiv preprint arXiv:2101.03288. Cited by: Score-based Generative Models.
[20] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole (2020) Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456. Cited by: Score-based Generative Models, Score-based Generative Models, Score-based Generative Models, Classifier Guidance, Guidance.
[21] P. Vincent (2011) A connection between score matching and denoising autoencoders. Neural computation 23 (7), pp. 1661–1674. Cited by: Score-based Generative Models.