Understanding DDPM objective with basics (Maximum likelihood, Bayes Nets, Markov Models, KL divergence), Part 2

Luv Verma
8 min readApr 12, 2023

--

Welcome to Part 2 of the exciting blog series, where we’ll dive deep into the world of diffusion models — a cutting-edge class of generative models that are taking the AI landscape by storm!

In this part 2 of the blog series, I am continuing the journey of moving towards the development of objective function for the Denoising Diffusion Probabilistic Models (DDPM) paper. Before diving into the derivation, we’ll explore essential concepts to facilitate understanding, including likelihood and maximum likelihood functions, negative log-likelihood, and the KL-divergence theorem, Markov Models, and Bayesian Networks.

In continuation from the first blog (link to part 1)…

From the first blog, now we know that equation 1 (Figure 1) and equation 2 are the equations for the forward and reverse diffusion process respectively.

Lets work on equation 1. What’s the need you may ask? Well it represents the transition in the forward diffusion process from time t-1 to time t.

Since there is no training involved, can we not directly try to jump from time 0 to current time ‘t’ instead of time ‘t-1’ to time ‘t’. Yes, we can. Let me show you how it is done in DDPM paper.

From the Gaussian Normal distribution and the parametrization trick, we know that can re-write the normal distribution in terms of mean and variance (equation 3) below:

Using equation 3, and equation 1, we can re-write the equation 1 as shown in equation 4 below:

Time, to re-visit some of the notations used in DDPM paper. In DDPM paper, for convenience following are the notations used (equations 5). Take it with a pinch of salt, without thinking much about it for now. It will bear fruits later.

substituting results from equation 4 into equation 5, we get to the following:

What’s the point of a set of equation 5 and equation 6, why wasn’t equation 4 enough? Well, these set of equations help us by condensing the first term in equation 4, leading to a shorter version in equation 6. But why should I care? Look at x(t-1) sitting in the first term. It smells of recursion.

Yooo, Recursion !!!

It means I can recurse and make the first term in equation 6 go to 0 instead of (t-1), which in turn means I can go directly from initial time step 0 to current time step t in the forward diffusion process.

Thus, set of equations (equation 7) gives us the way of reaching from initial time step 0 directly to time step (t)

Using the last equation in equation 7, and equation 3 and equation 4, we can re-write the forward diffusion process equation as:

Thus, from the set of equation 7 and set of equation 8, we have the simplified formulation for the forward diffusion process, which simplifies the process and gives us the way of transitioning directly from step 0 to step t.(reiterating the final equation for forward diffusion as equation 9 below):

Thus, I can say that forward diffusion is much simpler now (because of equation 9) than what we started with (equation 1).

Before diving further, time for some basics related to likelihood function and probability.

Likelihood function:

From equation 10, suppose we have observations (x0, x1, x2,… x(T-1), X(T-2), x(T)) and assume that it is from some distribution that we do not know about and that distribution has unknown parameter represented by Theta, then the definition of likelihood is given in equation 10.

Now assume that all the observations (x0, x1, x2,… x(T-1), X(T-2), x(T)) were independent, then from the definition of Bayes Net, we can modify equation 10 into equation 11.

equation 11, says that don’t worry, if all your observations are independent, then the likelihood is just the product of the probability of occurrence of observation x(i) given parameter Theta.

Maximum Log-Likelihood:

From equations 10, and 11 I have the likelihood function and its representation in terms of the product of probabilities for observations given parameter Theta (equation 11). The maximum log-likelihood estimate is defined as the estimate of parameter Theta which maximizes the likelihood (equation 12, and Figure 1)

Figure 1:

Negative Log-Likelihood:

In general, the objective function is to minimize something, let us say for example the error between two quantities (we will get to why I am saying this), but for now, take it from me that our objective is to minimize the error. Therefore, can we use equations 12, 11, and 10, and reach a very general objective function, which we can work with? Yes, we can and we will invoke something called a Negative Log-Likelihood.

So we know that from equation 12, the maximum likelihood estimate will maximize, what if I modify equation 12 into equation 13 as shown below?

I did nothing fancy, just introduced a negative sign in equation 13, and to keep the objective as same (equation 12), I am minimizing instead of maximizing.

The term in the orange bracket is generally called as negative log-likelihood. thus in equation 14, we can re-write negative log-likelihood in terms of maximum log-likelihood as:

Now, let me simplify things, using equation 11, 13 and 14, we will get to the set of equations (equation 15). NLL is the short form of negative log-likelihood:

In the DDPM paper and in the reverse diffusion process, the goal is to recover the original image (x0) back from the noisy image (xt) at each step. This in turn means, if we minimize the error in the original image, I am done and I do not need to consider the negative-log-likelihood for each iteration step t.

Therefore, based on the above logic, we can re-write the objective function as (equation 16):

Rewriting the same thing above, according to DDPM paper, we have equation 17.

The focus on x0 in the NLL objective function (equation 17) ensures that the denoising network is guided toward the accurate recovery of the original data, resulting in a better generative model.

Simple isn’t it? Now, while we are on the topic of discussing basics, let’s discuss the KL-divergence theorem.

Markov Models: Refer to my other blog.

Bayes Nets: Refer to my other blog.

KL-divergence theorem:

In very simple terms, KL divergence is a way to measure how different two probability distributions are from each other. It gives you an idea of how much extra information you would need if you used one distribution (q(x)) to approximate another distribution (p(x)).

For two probability distributions p(x) and q(x), the KL divergence formula is (equation 18):

This formula sums up the product of the probability of each event x in p(x) and the logarithm of the ratio between p(x) and q(x). If p and q are very similar, the KL divergence will be close to 0. If they are very different, the KL divergence will be larger. Note that KL-divergence cannot be negative (log is involved).

Search for a simpler objective function. Finding meaning out of equation 17

Equation 17 above is unknown but we can modify it to something. Including the definition of evidence lower bound (expressed in terms of KL-divergence, equation 18), we can re-write equation 17 as:

What the heck is equation 19? You must be thinking, I promised to make it simple. Please be patient, and let me try to explain what is happening above.

Think of a scenario, we know we have an image which is completely noisy, we know we have to go back to the original image, we know we have an objective function for that (equation 17), but that is unknown. What can we do? We have just learned the definition of KL-divergence and we know all about the forward diffusion process. We know the primary aim is to get back to the original image (x0) in the reverse diffusion process. We know that we have a neural network in our kitty and we know about Bayes Theorem (Wikipedia rules!). Now if we think about it, we can say that okay, we know the original image/distribution (in the forward diffusion process) and let’s say we can get another distribution using the parametrized network. we can calculate the KL-divergence between them and add it to the first term which we know nothing about.

Okay make sense, but why? we’ll see in the third and final part of this series to reach the objective function.

…since this is getting bigger, objective function derivation, is to be continued in part 3.

(link to part 1)…

(link to part 3)…

If you like it or find it useful, please clap and share.

--

--