NumByNum :: Denoising Diffusion Probabilistic Models (Ho et al., 2020) Reviewed

20 min readNov 2, 2023

This review of “Denoising Diffusion Probabilistic Models (Ho et al., 2020)” begins at Number 1 and concludes at Number 155. I may make additions or revisions to the content in the future for updates. I am creating this post primarily for my personal understanding of the subject, and I humbly acknowledge that errors or inaccuracies may be present. If you happen to identify any such issues, please do not hesitate to bring them to my attention. Thank you, and hope you enjoy😊!

1. My previous studies on diffusion have become a bit hazy, so I felt the need to revisit the topic.

2. So, today’s topic is the “Denoising Diffusion Probabilistic Models” (Ho et al., 2020) paper.

3. Yep, just like always, I’ll be taking you on a merry-go-round ride. 🎠

4. In 2015, a paper titled “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” (Sohl-Dickstein et al., 2015) was published.

5. You might wonder why a deep learning paper suddenly mentions the word “thermodynamics.”

6. That’s because it drew inspiration from Brownian motion.

7. Let’s imagine dropping a clump of pollen into a pond, for instance. Initially, shortly after dropping it, the particles will cluster together, but over time, they will gradually disperse and spread out. When particles scatter in a liquid or gas, following random motion, it’s referred to as Brownian motion.

8. Looking at the illustration above, you can see that the particles inside the yellow circle don’t move in a predetermined direction but take random steps, guided by specific probability distributions each time. This gradual, random spreading of particles is called diffusion.

9. Now, if you wanted to revert the fully dispersed, nearly random state of pollen back to its initial round shape, how would you do it?

10. You’d have to retrace the steps of random motion taken so far in reverse, effectively going against the diffusion process. By continually going in reverse, you could eventually return to the initial state.

11. This idea, applied directly to image generation, is the basis of the diffusion model.

12. First, prepare a pristine image. This image represents the initial state of the pollen’s dispersion.

13. Now, add Gaussian noise to the image over 1,000 steps. After 10 steps, you can still recognize the original image, but after around 500 steps, it becomes increasingly challenging to identify the details. Adding noise continuously for 1,000 steps will render the original image completely unrecognizable.

14. From the leftmost image in the illustration above, we progressively add Gaussian noise at each step, eventually transforming it into a completely noisy Gaussian image.

15. This is the forward diffusion process, where the original image is diffused into a Gaussian image.

16. The process mirrors the gradual dispersion of pollen through random motion.

17. Now, what we want to do is straightforward.

18. Since we’ve defined the forward diffusion process from the cat to noise, we’ll create a reverse process to go from noise to the cat. We’ll insert any noise and extract the cat image.

19. The paper’s idea was to learn a method to remove noise layer by layer, allowing for the generation of a natural original image regardless of the noise.

20. Therefore, papers in the diffusion family typically involve 1) defining the forward process from the original image to a complete noise, and 2) defining the posterior q(x_{t-1} | x_t, x0) for removing one step of noise. Then, 3) training the backward model p_theta to approximate the true posterior and 4) setting the loss function to make them close.

21. Let’s first examine how the original diffusion paper by Sohl-Dickstein et al. (2015) set up these four components.

22. First, the forward trajectory:

23. Equation (2) represents the forward movement from the t-1 image to the t image by adding a little more noise. Here, T_phi is the Markov diffusion kernel, and beta is the diffusion rate.

24. So, what is the Markov diffusion kernel?

25. To understand this, we need to touch on what a Markov process or Markov chain is.

26. A Markov process refers to a discrete-time stochastic process that satisfies the Markov property.

27. In other words, it’s a process where the probability of an event occurring changes over time (stochastic process) and the time intervals at which it changes are discrete (discrete-time), and it satisfies the Markov property.

28. Then, what does the Markov property mean?

29. The Markov property states that, given the past states (t=1, t=2, t=3) and the current state (t=4), the future state (t=5) depends solely on the current state and not on all past states.

30. For example, in the figure above, the probability of reaching A from E is determined only by whether the previous state was E or A, regardless of the path taken previously (i.e., AEAEA or AAEAA). Regardless of the path taken, the probability of reaching A depends on whether the previous state was E or A.

31. This property ignores past states and only predicts the future based on the current state, earning the name “memoryless process.” The equation looks like this:

32. And this state transition, from one state in the Markov chain to the next, is called a transition, with the probability of this transition being the transition probability. The equation looks like this:

33. In short, the Markov diffusion kernel is just a way of representing our Markov diffusion process, which undergoes diffusion one step at a time, satisfying the Markov property. For a more detailed definition, you can refer to the figure below and here.

34. Going back to Num 23, we’ve learned how to add noise from t-1 to t using the diffusion rate and Markov diffusion kernel through Equation (2). Now that we’ve constructed this one-step rightward movement, we need to define the entire forward trajectory from the original image to noise.

35. However, we only know how to move one step from step1 to step2, from step2 to step3, and so on.

36. So, we need to connect them all together like a chain.

37. This is where Equation (3) comes in.

38. In the right-hand side, by applying Markov diffusion kernel q for t=1 to T, creating transitions from t=1 to t=2, t=2 to t=3, and so on until t=T-1 to t=T, then multiplying all of these, we define the entire forward process q(x_0:T) according to Bayes’ rule.

39. Now that we’ve defined the forward process, as part of the four components we discussed earlier, we need to create the posterior that removes one step of noise.

40. However, there is one problem.

41. The forward process, which adds Gaussian noise, is well-defined as Gaussian. We’d like the reverse process, which removes noise, to also be Gaussian, but it’s not certain if this is allowed. Can we claim that the reverse process is also Gaussian when the forward process is Gaussian?

42. Fortunately, Feller et al. (1949) addressed this issue long ago. They proved that, for diffusion with small step size beta, whether it’s Gaussian or binomial, the forward and reverse processes have the same functional form.

43. Therefore, as mentioned in the figure above, since the Markov diffusion kernel we’ve defined follows a Gaussian distribution, when beta is small, we can represent the distribution for the reverse transition from step t to step t-1 as a Gaussian distribution.

44. Consequently, we define the reverse process in the same way as the forward process.

45. You’ll notice that the formula and structure of the forward process are the same as the reverse process.

46. Now we need to define the loss function appropriately and train using the log likelihood of the reverse Markov transition.

47. While the actual paper presents very lengthy formulas, we’ll focus on the conclusion here since we’ll delve into how the loss is derived when we look at DDPM. For detailed formulas, refer to Section 2.4 and the Appendix in the paper.

48. For now, let’s just look at the conclusion.

49. As mentioned in Num 46, we’ll find the reverse Markov transition that maximizes the log-likelihood. At this point, K takes on the following form after various derivations:

50. Now, let’s summarize the basic concept of diffusion that we’ve looked at so far.

51. The initial diffusion presented by Sohl-Dickstein et al. (2015) can be understood as redefining the problem of estimating the probability distribution used to add noise in the backward process (used to remove noise) by manipulating the loss function appropriately, turning it into a problem of estimating the means and variances of a sequence of Gaussians through the adjustment of the (loss function) sequence of Gaussian means and variances.

52. Let’s summarize it with the four key components.

53. The Forward process defines a single step of noise propagation from step t-1 to step t using the diffusion rate beta. These steps are then multiplied successively to compute x_T from x_0.

54. Posterior refers to the target distribution denoising one step. However, if t = 1,000, we can’t compute 1,000 individual distributions sitting around. Thus, we condition x_t on x_0 to compute and remove the noise in one step from x_t.

55. Now, we train our backward model, p_theta, to approximate this target distribution, the posterior. This p_theta, when given x_t, predicts the means and variances of Gaussian distributions, revealing what Gaussian noise was added and removing it to create x_{t-1}.

56. As both forward and backward are Gaussian in our case, making the backward approximate the posterior involves simply calculating the KL divergence between the two Gaussian distributions.

57. It’s alright if some parts are still unclear; we’ll revisit them later when discussing DDPM.

58. Now, let’s finally move on to today’s topic, the “Denoising Diffusion Probabilistic Models” paper by Ho et al. (2022).

59. The first author is Jonathan Ho, and if you’ve dabbled in diffusion before, you might find the name familiar.

60. If you look at his publications list, there are many interesting papers to read. We will definitely cover “Classifier-free diffusion guidance” in a separate article later on.

61. Anyway, going back to the topic of diffusion, we examined the 2015 version of diffusion.

62. But there was an issue. While the idea and structure were ingenious, the performance was not up to the mark. To be precise, the quality of the generated images was quite poor.

63. The paper only generated MNIST-level images, and if it had been tested on real images, the results would have been even worse. It simply couldn’t be used for image generation.

64. Furthermore, to make matters worse, a formidable competitor entered the scene. 🥊💥

65. The name of this famous competitor was the “Generative Adversarial Networks” (Goodfellow et al., 2014). We’ll definitely summarize this paper in a separate article when we get the chance.

66. As a result, with higher-quality results emerging around the same time, research in the same direction has gained momentum. While GAN-related papers continue to pour in, diffusion-related work has relatively receded.

67. Nowadays, diffusion techniques are used in various domains, from images and videos to code generation. For instance, the “CodeFusion: A Pre-trained Diffusion Model for Code Generation” (Singh et al., 2023) incorporates diffusion, but back then, diffusion research was far from mainstream.

68. But lo and behold, this DDPM paper swept in like a superhero, rescuing diffusion from the clutches of obscurity!🚀💥

69. “This paper presents progress in diffusion probabilistic models. A diffusion probabilistic model (which we will call a “diffusion model” for brevity) is a parameterized Markov chain trained using variational inference to produce samples matching the data after finite time. Transitions of this chain are learned to reverse a diffusion process, which is a Markov chain that gradually adds noise to the data in the opposite direction of sampling until signal is destroyed.”

70. "Diffusion models are straightforward to define and efficient to train, but to the best of our knowledge, there has been no demonstration that they are capable of generating high quality samples. We show that diffusion models actually are capable of generating high quality samples, sometimes better than the published results on other types of generative models. In addition, we show that a certain parameterization of diffusion models reveals an equivalence with denoising score matching over multiple noise levels during training and with annealed Langevin dynamics during sampling. We obtained our best sample quality results using this parameterization.”

71. So, let’s examine how the original diffusion and DDPM differ.

72. DDPM is also a type of diffusion, so it defines the forward process using a Markov chain and defines the posterior for denoising, then trains the model to approximate the posterior for the reverse process, just like before.

73. However, the crucial difference is the way they predict epsilon to estimate the Gaussian distribution, as opposed to directly predicting the means and variances of Gaussian distributions in the 2015 diffusion.

74. Let’s delve into this modification in more detail.

75. First and foremost, our paper summarizes the 2015 diffusion model like this. 👇📝

76. What it wants to learn is the reverse process p_theta(x_0:T) that restores (complete noise) x_T to (original image) x_0. To do this, we express the step of removing one step of noise, p_theta(x_{t-1} | x_t), using the mean (mu_theta) and variance (sigma_theta) of the normal distribution.

77. Conversely, the forward process that goes from (original image) x_0 to (complete noise) x_T is defined using a Markov chain. If you look at the equation on the right, you can confirm that q(x_t | x_{t-1}) is defined using the diffusion rate beta.

78. Intuitively, you can understand this equation as slightly reducing x_{t-1} by the diffusion rate beta and adding noise equal to beta to get x_t. The reason for the square root of 1 — beta is to keep the overall variance as 1.

79. To clarify, it’s like Var(ax) = a²Var(x), and Var(x+y) = Var(x) + Var(y), which you might have learned in statistics. The square root of 1 — beta squared, when added to the following beta, ensures that the total variance remains 1.

80. In any case, after defining the forward and reverse processes in this way, the 2015 diffusion optimizes the variational bound of the negative log likelihood expression as shown below.

81. So, as you can see from the above, the conclusion in Sohl-Dickstein et al. (2015) was to estimate the reverse process by predicting the means and variances of Gaussian distributions.

82. Anyway, as you can see from the fact that “theta” is attached to “mu” and “sigma” in Num 75 (indicating that they are trainable), the 2015 diffusion paper estimates the reverse process by predicting the mean and variance of the Gaussian distribution.

83. Now, our paper makes a slight modification to this.

84. First, we don’t want to add noise one step at a time; we want to jump multiple steps at once.

85. As shown in the figure, instead of going from x_0 to x_1, x_2, and x_3 to get to x_4, we want to jump directly from x_0 to x_4. To do this, we redefine the diffusion kernel.

86. Looking at equation (4), you can see that we replace the 1 — beta, which represents the degree of diffusion, with a value alpha. Then, we define alpha_t as the product of alphas from alpha_1 to alpha_t.

87. Just as we defined the entire forward process by multiplying alpha values successively to move one step at a time, here too, we multiply alpha values successively to connect multiple steps at once.

88. This way, as alpha approaches 0, i.e., as the diffusion rate beta becomes larger, the original image x_0 is preserved less, and the resulting image x_t becomes closer to random Gaussian noise.

89. By making this change, we can express the relationship between x_t and x_0 as a linear combination as shown below.

90. Essentially, x_0 is preserved to the extent of square root alpha, and random Gaussian noise sampled from N(0, I) with a standard deviation of square root 1 — alpha is added to x_0 to create x_t.

91. Once we consider this change, the entire process transforms as follows.

92. In the middle box in the top row, now we can jump directly from x_0 to x_t instead of adding noise one step at a time. That is, we use x_0 and epsilon instead of conditioning on x_{t-1} to compute x_t.

93. Then, we multiply these steps from 1 to T successively to complete the entire forward diffusion process.

94. In reverse, the process of removing noise also changes.

95. Originally, the first equation in the middle box on the bottom row was used to directly compute mu and sigma. In DDPM, we don’t do that and compute x_{t-1} from x_t, t, and epsilon.

96. In this case, mu_theta is already defined as shown in the equation below.

97. Sigma_theta also does not have to be trained and is pre-fixed according to the equations for beta and alpha.

98. As a result, all terms in the equation for computing x_{t-1} are composed of pre-determined alpha and beta, and epsilon predicted through the model.

99. In other words, the problem of estimating the mean and variance of a Gaussian distribution, which we had in the original diffusion, has now turned into a problem of predicting just one epsilon.

100. The epsilon_theta, which is responsible for predicting epsilon, is a simple U-net structure that takes x_t and step t as inputs to produce epsilon.

101. In other words, by training only this U-net, we can obtain epsilon, and with this epsilon alone, we can calculate the mean (mu) and standard deviation (sigma) of the Gaussian distribution. Consequently, we can obtain the probability distribution that goes from step t to step t-1.

102. Now, all that’s left is to define the loss function.

103. Originally, in the 2015 diffusion, training was carried out using the following equation.

104. Now, we’re going to transform this equation a bit. You can find a detailed derivation here.

105. Once you expand the log-likelihood equation according to Bayes’ rule, you get equation (1).

106. Then, as shown in (2), even if you multiply q(x_{1:T} | x_0) to both the denominator and the numerator, the equation still holds.

107. However, the part highlighted in red represents the KL divergence between q(x_{1:T} | x_0) and p_theta(x_{1:T} | x_0), which is always greater than 0. Therefore, (3) holds true.

108. I plan to write a separate article about KL divergence when I have the chance, so for now, let’s briefly touch on it.

109. In simple terms, KL divergence is a value that represents the difference between two probability distributions.

110. If we have probability distributions Q(x) and P(x) as shown in the picture, the difference between these two distributions, known as KL divergence, can be visualized as the area of the green function on the right. For continuous distributions, KL divergence can be expressed in the formula below.

111. Let’s explore what this formula actually means.

112. Let’s think about the binary cross-entropy formula that we commonly use.

113. When the correct answer is y and the model’s prediction is p(y), the cases can be divided into four: 1) the correct answer is 1, and the prediction is 1, 2) the correct answer is 1, and the prediction is 0, 3) the correct answer is 0, and the prediction is 1, 4) the correct answer is 0, and the prediction is 0.

114. In case 1), it results in 1 * log1 + 0 * log0 = 0, so the loss is 0, marking it as the correct answer. Similarly, in case 4), 0 * log0 + 1 * log1 = 0, so the loss is also 0, making it the correct answer.

115. On the contrary, in case 2), with 1 * log0 + 0 * log1, considering the preceding negative sign, the loss becomes immensely large. Similarly, in case 3, 0 * log1 + 1 * log0 results in an infinite loss.

116. This is how binary cross-entropy guides the model towards the correct answer through loss. It’s called cross-entropy because we ‘cross’ multiply q and p.

117. If we extend this to continuous distributions instead of binary, it leads to the formula below.

118. Now, let’s get back to the KL divergence equation.

119. Expanding the logarithm here results in this.

120. However, the red term highlighted here where p multiplies log q is essentially the same as the cross-entropy in equation (8). So, it represents the difference between the model’s prediction q and the actual correct answer p.

121. On the other hand, the blue term where p multiplies log p is equivalent to the definition of entropy, as you can see in the diagram below. It remains a constant determined by p, regardless of the model’s prediction q.

122. In conclusion, when we use cross-entropy loss, we effectively disregard the constant blue part from KL divergence’s formula and only use the red part where p and q are multiplied to optimize the model.

123. Let’s go back to the equation in Num 119.

124. Here, it is important to note that, as we saw in the example of binary cross-entropy in Num 114, cross-entropy results in a smaller loss when the correct answer and the prediction are the same. In other words, the closer the distributions of the correct answer P and the prediction Q are to each other, the smaller the entropy becomes.

125. Therefore, entropy H(P, P) for identical distributions is smaller than entropy H(P, Q) for different distributions.

126. In conclusion, KL divergence, defined as H(P, Q) — H(P), is always positive since it subtracts a smaller value from a larger value.

127. This is why I mentioned in Num 107, “KL divergence term is always greater than 0, so equation (3) holds.”

128. So, up to this point, we have examined the reason equation (3) holds using the definition of KL divergence, and now we will continue with the original derivation of the equation.

129. In the diagram, (4) is essentially the same as (3) in a different notation and is therefore equivalent.

130. Also, (5) naturally holds as we have already defined the forward process in this way.

131. (6) is simply a separation of the formula in (5) using the properties of logarithms.

132. Let’s try transforming the equation a bit further.

133. (7) is a direct copy of (6), so let’s move on. (8) separates the part for t=1 to transform the equation.

134. However, the blue term highlighted in (8) can be rewritten as follows.

135. According to the Markov property, x_t is derived solely from x_{t-1}, whether x_0 is present or not, making the first equality hold. Expanding this using Bayes’ rule leads to the last equation.

136. Using this, we can replace the blue term in (8) with the blue term in (9).

137. Expanding the term grouped by the sigma according to the properties of logarithms results in (10). However, the third term in (10) has both the numerator and denominator as q, and when expanded, they cancel out, ultimately simplifying to (11).

138. But coincidentally, the last term in (11) has q(x_1 | x_0) in the denominator. Therefore, the blue term in the denominator can be combined with the front p_theta term, the red terms get eliminated, leaving only the numerator.

139. This results in equation (12).

140. Here, if you look closely, both the first and second terms in (12) can be expressed as KL divergence.

141. This intricate process led us to the loss term in equation (5).

142. From this, we obtain two terms: L_T, which concerns the forward process from x_0 to x_T, does not affect by our training since we predetermined the diffusion rate beta as a constant.

143. The second term, L_{t-1}, is precisely the term that makes the backward process approximate the posterior. In other words, this term helps the distribution we want to train to resemble the true noise-removing target distribution.

144. In the original 2015 version of diffusion, this L_{t-1} term was calculated as follows.

145. Since both posterior and p_theta are Gaussian, we can calculate KL divergence between the two as simply as in equation (8).

146. However, we decided to predict epsilon instead of mu and sigma.

147. So, we can simplify this equation a bit further.

148. In other words, we’ve simplified the previous L_{t-1} term, which used to predict mu, to predict epsilon.

149. Now, let’s recap the four components of DDPM and wrap up.

150. Similar to the 2015 version of diffusion, DDPM defines the forward process through a Markov chain. It also uses diffusion rate beta to add noise, but the difference is that it introduces a shortcut to create x_t directly from x_0 using alpha.

151. Then, it defines a posterior that performs one-step denoising given x_t and x_0.

152. The backward process is defined, and instead of directly predicting mu and sigma, we express them as a function of epsilon, which is then estimated by a neural network.

153. Consequently, the loss function has been simplified for epsilon.

154. In summary, the algorithm can be described as follows.

155. DDPM significantly improved the performance of diffusion, but it still has certain limitations. To overcome these limitations, the paper “Denoising Diffusion Implicit Models (Song et al., 2021)” proposed some solutions, which we’ll explore in a separate article soon.

This concludes the review, with the current content covering up to number 155. I may add or edit information in the future to provide updates. This post is primarily for my personal understanding of the topic, and I kindly acknowledge that there may be errors or inaccuracies present. If you come across any, please feel free to bring them to my attention.

Thank you for taking the time to read this, and congratulations🎉🎉!

NumByNum :: Denoising Diffusion Probabilistic Models (Ho et al., 2020) Reviewed

Written by Aria Lee