NumByNum :: Gaussian Dreamer — Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors (Yi et al., 2023) Reviewed

Aria Lee
26 min readOct 17, 2023

If you’re wondering why these posts keep popping up so frequently, it’s because when there’s work to be done, procrastination becomes the world’s most captivating pastime, paws down! 🐾😸📚

This review of “Gaussian Dreamer — Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors (Yi et al., 2023)” begins at Number 1 and concludes at Number 207. For those of you who’ve already savored the delights of our previous 3D Gaussian Splatting article (Numbers 107 to 185), fear not! You’re in for a sweet rerun! 🎥🍿✨

I may make additions or revisions to the content in the future for updates. I am creating this post primarily for my personal understanding of the subject, and I humbly acknowledge that errors or inaccuracies may be present. If you happen to identify any such issues, please do not hesitate to bring them to my attention. Thank you, and hope you enjoy😊!

1. Suddenly, I saw this paper popping up here and there, and it seemed to be about generating 3D objects from text using 3D Gaussian Splatting and Diffusion, so I decided to take a closer look.

2. It’s none other than “GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors” (Yi et al., 2023).

3. The first author is Taoran Yi, who hails from Hwajung Science and Technology University. It appears that this paper was a collaborative effort between Hwajung Science and Technology University and the Huawei research team.

4. When looking at other papers the author has participated in, there’s “Fast Dynamic Radiance Fields with Time-Aware Neural Voxels” (Fang et al., 2022).

5. NeRF takes too long to optimize a single scene. In contrast, using an explicit data structure like voxels can significantly speed up the process. So, they propose a radiance field framework called TiNeuVox, which utilizes time-aware voxel features.

6. Consequently, our paper can be trained in just 25 minutes on a single GPU and offers real-time rendering. While the other paper suggested voxels as an explicit data structure, our paper employs the use of 3D Gaussian (as proposed through 3D Splatting).

7. Whether it’s reconstruction or generation, I can see a fierce determination to speed up the 3D modeling process. It’s like you can feel the resolute gaze beneath those sunglasses! 😎💨

8. So, let’s dive deeper into our paper.

9. The paper’s main task is text-to-3D generation. Just like text-to-(2D) image generation, the aim is to naturally generate images that match a concise text prompt.

10. This field has been widely researched, leading to various approaches. Among them, the most prominent approaches are VAE, GAN, and diffusion.

11. VAE uses an encoder to transform the input into a latent space and a decoder to generate from that space. We’ll delve into this in more detail in a separate article.

12. GAN features a generator that creates fake samples and a discriminator that determines whether they are real or fake. The generator gets better at creating realistic fake images through competition.

13. It’s akin to a forger making counterfeit money, which the police try to detect. This way, forgers continually improve their skills.

14. Diffusion involves applying Gaussian noise to the original image x_0 100 or 1,000 times over different timesteps to create x_T.

15. This is the noising process.

16. The model then learns to remove one layer of Gaussian noise from x_t to x_t-1.

17. This is the denoising process, aimed at eliminating noise.

18. Once a model trained for denoising is ready, introducing random Gaussian noise and repeatedly applying denoising generates a new x_0 image.

19. To truly understand diffusion, it’s advisable to read papers like DDPM or DDIM, which might warrant separate articles. So, this article will provide a level of detail sufficient to comprehend our paper, and any gaps can be filled later.

20. With multiple methods for solving 2D image generation, and significant performance improvements, it’s only natural to desire extending this to 3D objects.

21. Previously, if you generated a 2D image of a “red sedan with a wild horse logo racing down the road,” now you want to create a 3D object of that sedan and observe it from all angles.

22. However, 3D models in the field, including generation, constantly face a major issue.

23. Unlike 2D, 3D data is scarce.

24. While there are plenty of photos for 2D, obtaining enough 3D data for learning is challenging, both due to the limited existing data and the difficulty in creating new datasets.

25. Therefore, 3D generation models can be classified into two categories.

26. The first group insists that even with limited data, both the learning and data should be in 3D. Notable examples include Point-E, Shap-E, and 3DGen.

27. Training with 3D data offers the advantage of good consistency, but expanding the dataset is challenging, and creating higher-quality 3D assets is limiting. Expanding the generation domain is also tough.

28. The second group questions, “Why not make use of what’s been thoroughly researched in 2D?”

29. Since 3D is essentially one more dimension than 2D, and there are already many high-performing 2D generation models, they suggest starting with 2D generation and lifting the final 2D image to 3D.

30. The leader of this group is DreamFusion.

31. We will delve into DreamFusion in a separate article. For now, just understand it as a model that uses 2D diffusion models to facilitate easier and more accurate 3D object creation, instead of solely and directly generating 3D objects.

32. This approach of repurposing 2D models is promising, but it also has its challenges.

33. When you simply lift a 2D image to 3D (especially for complex objects), maintaining geometry consistency is difficult because crucial information about the camera view is missing.

34. Geometry consistency, in this context, refers to the fact that pixel depth inconsistencies prevent a group of different images taken of a single 3D object from creating a consistent representation of the object. For a more detailed definition, you can refer to this paper.

35. In this context, our paper maintains the fundamental approach of DreamFusion, utilizing 3D Gaussian Splatting to address this 3D consistency issue.

36. Let’s dive into the details.

37. The overall pipeline of our paper involves: 1) providing an input text prompt, 2) using this text in an existing 3D diffusion model to create a point cloud, 3) rendering this point cloud into a 2D image through 3D Gaussian splatting, 4) inputting this image into an existing 2D diffusion model to calculate loss, and 5) updating the 3D Gaussian.

38. To understand the pipeline, we need to break it into three major parts.

39. First is how the 3D diffusion model works, which converts input text into a point cloud. This corresponds to steps 1 and 2 in the pipeline.

40. Second is understanding what the 3D Gaussian Splatting, which transforms the generated point cloud into a 2D image, is all about. This corresponds to step 3 in the pipeline.

41. Third is how the 2D diffusion model operates, which calculates loss based on the generated 2D image. This encompasses steps 4, 5, and 6 in the pipeline.

42. While it may sound like a lot, it’s not as complex as it seems.

43. Let’s start by examining the first part, the 3D diffusion model.

44. In our paper, the 3D diffusion model takes text input, such as “fox,” and is used to generate a point cloud representing this “fox.” It’s illustrated well in the image below.

45. However, one might wonder why we are generating point clouds in the first place.

46. If you’ve read my previous post about 3D Gaussian Splatting, you’d understand that this is because the next step, 3D Gaussian Splatting, takes point clouds as input.

47. Anyway, in our paper, we employ Shap-E for this 3D diffusion model that transforms the prompt into a point cloud. Specifically, we use “Shap-E: Generating Conditional 3D Implicit Functions” (Jun et al., 2023) for this purpose.

48. Shap-E’s encoder is divided into two parts: 1) transforming explicit 3D assets into implicit parameters and 2) utilizing the obtained latent representation to perform diffusion.

49. We’ll focus on the model’s structure, skipping the details for now.

50. The encoder looks like this.

51. Looking at the description of the encoder architecture in Appendix1, first, the encoder takes a 20,480 vector multiview point cloud. To create this, it uses 1) an RGB point cloud with 16,384 points and 2) 20 views of 256 x 256 images rendered from random camera angles for each 3D asset. This corresponds to the patch embedding in the image above since these are image-like representations created by rendering.

52. Firstly, the point convolution layer takes this input point cloud and transforms it into a 1K-dimensional embedding. This embedding combines with the learned input embedding to form the complete query sentence h, as shown below the cross-attention block in the image.

53. Then, cross-attention is computed between the input point cloud and the query embedding h, updating h.

54. In the next block, as mentioned in 51, the multiview patch embedding is processed with h through cross-attention, updating h once more.

55. This h passes through the transformer block to obtain a latent vector, which can be normalized to the range between -1 and 1 using the tanh function.

56. Up to this point, we’ve obtained the latent representation for the diffusion, but we haven’t acquired the MLP parameters yet. So, this step is solely about embedding the 3D asset as a latent vector.

57. Now, we need to proceed with diffusion using this calculated latent vector. Our paper lacks detailed explanations of this part, but there’s a hint in the mention that we applied the transformer-based diffusion architecture from the Point-E paper. You can likely find more details here.

58. To understand how this diffusion process works, you should refer to the “Point-E: A System for Generating 3D Point Clouds from Complex Prompts” (Nichol et al., 2022) paper. I got curious about why we referenced this paper, and it turns out that the first author of Shap-E was also involved in the Point-E paper. It seems we can directly reference and adopt the Point-E model.

59. If you read it, you’ll see that the approach involves using the Gaussian diffusion setup from the famous Denoising Diffusion Probabilistic Models” (Ho et al., 2020) paper.

60. We won’t delve into the DDPM paper here, but understanding the key concepts relevant to our paper is essential.

61. The idea behind diffusion techniques like DDPM is to 1) add Gaussian noise to images at time t-1 to create the image at time t, 2) train models to denoise these images at t-1 when given the image at t, and 3) use these denoising models to reconstruct the original image when provided pure Gaussian noise at time n. The process is about removing one layer of noise, so we train models to achieve this, allowing us to obtain the clean (t=0) original image.

62. In this case, we’re training a model to strip away one layer of noise, and by adding random Gaussian noise repeatedly, we can obtain a realistic image for t=0.

63. The operation where Gaussian noise from t-1 is applied to an image at t is referred to as “noising,” and it can be mathematically expressed as follows:

64. To understand what Beta represents and how this equation is derived, you can check this source.

65. However, if, for instance, t=1,000, this means you’d need to perform a single training run 1,000 times, and this is highly time-consuming.

66. So, Ho et al., (2020) introduced a shortcut that allows you to jump from the original image x_0 to the fully noisy image x_t, bypassing the need for incremental steps.

67. We can explore why this is possible, how this fits into the broader context of diffusion, and its connections to other diffusion papers like Denoising Diffusion Implicit Models (Song et al., 2020) in a separate article. For now, let’s accept this equation and move on.

68. In summary, a diffusion model aims to mimic q(x_t-1|x_t) using a neural network p_theta(x_t-1|x_t). When you train this model, you can denoise a sample from random Gaussian noise x_T step by step to obtain the noiseless sample x_0.

69. Now, let’s see how Point-E applies this general diffusion concept to create point clouds.

70. Point clouds are initially a combination of coordinates (x, y, z) and colors (r, g, b) packed into a K x 6 tensor. K represents the number of points, and these six numbers are all normalized between -1 and 1.

71. This normalization to -1 to 1 might be due to the use of the tanh function in Shap-E.

72. Next, they convert this random noise of shape K x 6 into K x D using a linear layer and apply the denoising technique mentioned in 61 to complete the target point cloud.

73. Conceptually, that’s how it works. Let’s take a look at diagrams to better understand the architecture.

74. Point-E takes the output grid from a frozen, pretrained CLIP model after passing through an input image. It also considers the current diffusion timestep t and the corresponding x_t. It feeds these tokens into a transformer to predict epsilon and sigma, elements that define the point cloud.

75. The exact definitions of epsilon and sigma can be found in the code on GitHub, specifically in the sapler.py and gaussian_diffusion.py files. After confirming these details, further updates can be made.

76. In a nutshell, the input size for Point-E is (K points + CLIP embedding length 256 + timestep length 1) x D, and the output sequence consists of K output tokens. These tokens help determine epsilon and sigma for each input point.

77. Using this approach, you obtain a point cloud consisting of K points.

78. Returning to the previous explanation, Shap-E utilizes the encoder structure to obtain a latent vector h from the point cloud and then updates it with the Point-E diffusion model. To be precise, there may be slight differences in the details, but we’ll verify those as needed in the Shap-E paper and move on for now.

79. Up to this point, we’ve covered the “DiffusionNoise(h)” part of the algorithm.

80. Next, through the “latent projection” block seen in diagram 78, h’ is projected.

81. By projecting h’ in this manner, each output becomes a row in an MLP weight matrix.

82. Obtaining this MLP weight matrix essentially means converting information from the explicit data structure of the point cloud into an implicit function.

83. The output achieved in this step can be used for NeRF or STF rendering.

84, You’ve likely come across information about NeRF in a previous article about 3D Gaussian Splatting. To briefly recap:

85. NeRF combines N 2D images with corresponding camera location data, using them to create training data. It takes 5D coordinates (x, y, z, theta, phi) as input, and for each point, it outputs (color, density) through a fully-connected neural network.

86. STF, or Signed Texture Field, is a 3D representation combining a signed distance function with texture color information.

87. So, what’s a signed distance function, you ask? 🤔

88. It’s a function that, as the name suggests, attaches a sign to distances. It is used as an another way of 3D shape representation.

89. A signed distance function attaches a sign to distance, which means it determines whether a point is inside or outside a 3D object. In simple terms, it returns the distance from a point x to the surface of a 3D object, with negative values indicating inside and positive values outside.

90. In other words, the signed distance function gives you the surface of the 3D object as the points where the function equals 0. This is another way to represent a 3D object implicitly.

91. To summarize, Shap-E takes any point cloud, encodes it into MLP weights for an implicit representation, refines the latent vector through diffusion, and generates the desired point cloud by gradually denoising a randomly perturbed version of the input.

92. And it’s Shap-E that serves as the 3D diffusion model in our paper.

93. Currently, we’re examining the Gaussian Dreamer’s overall pipeline, focusing on three parts: 1) transforming input text into a point cloud using the 3D diffusion model, 2) 3D Gaussian splatting for rendering point clouds into 2D images, and 3) optimizing 3D Gaussian with losses from rendered images using the 2D diffusion model.

94. With that, we’ve roughly gone through the first chunk of the pipeline, which involves using the 3D diffusion model to turn input text into point clouds.

95. Now, we need to pass these point clouds to the next stage, but doing so directly results in subpar initialization quality, which is not a good starting point.

96. So, we make two additional corrections.

97. These are“noisy point growing” and “color perturbation,” which are mentioned in the diagram below.

98. Noisy point growing is basically about expanding the family of points. 🌱😉

99. “First, we compute the bounding box (BBox) of the surface on ptm and then uniformly grow point clouds ptr(pr, cr) within the BBox. pr and cr represent the positions and colors of ptr. To enable fast searching, we construct a KDTree Km using the positions pm. Based on the distances between the positions pr and the nearest points found in the KDTree Km, we determine which points to keep. In this process, we select points within the (normalized) distance of 0.01.”

100. Then, this newly generated noisy point cloud should have colors that are similar to those of the nearby original points. So, color perturbation is applied to achieve this.

101. Perturbation means making slight changes. It involves tweaking the colors of nearby original points just a bit (perturbation) to create new colors for the new points.

102. The process of appropriately adjusting the point cloud obtained from the 3D diffusion model is represented by the algorithm in the image below.

103. This way, we have finally obtained the ultimate point cloud to feed into the model.

104. Now, let’s finally move on to the second component. This is the part where the point cloud generated in this manner is transformed into 2D images through 3D Gaussian splatting.

105. We’ve summarized the details of this in a separate article, so if you’re interested in more in-depth information, you can head there to read up. For those who want to cover everything in this article, here are the key takeaways.

106. If you’ve already delved into the previous article, feel free to take the express route to Number 185. Way to go! 🚀😉

107. The most common way to represent 3D shapes was through explicit representation like meshes, point clouds, or voxels. They had the advantage of allowing fast training and rendering, and it was easier to retrieve features from explicit data structures.

108. On the other hand, implicit representation, as seen in NeRF, involved stochastic sampling, making rendering computationally intensive. However, it excelled in optimization due to its continuity.

109. 3D Gaussian splatting is the fusion of these two advantages.

110. 3D Gaussian splatting represents the scene using Gaussians, much like how meshes use triangles for triangle rasterization.

111. While triangles are represented with just three coordinates, representing Gaussians requires a method.

112. So, they use four parameters: position (x, y, z), covariance (how it’s stretched/scaled: 3x3), color (RGB), and alpha to represent the Gaussian on the right.

113. Just as triangles are assembled to represent the entire shape, Gaussians work the same way.

114. When you combine three of them, it looks like this:

115. Attach about 7 million of these, and it becomes this:

116. We understand what a 3D Gaussian is and how it can be used for rasterization. Now, why is this better than the traditional mesh-based approach?

117. The key difference is that 3D Gaussians are differentiable. Being differentiable means optimization is easier. So, one of the reasons for choosing implicit representation is fulfilled.

118. Conversely, what’s better about it compared to using implicit representation? It’s still an explicit representation like meshes, making it suitable for rendering.

119. In any case, the core objective is to find a method that is continuous and differentiable while also allowing fast rendering. Using 3D Gaussians accomplishes this.

120. 3D Gaussians fundamentally replace triangles in triangle rasterization with Gaussians, making them explicit while preserving the advantages of volumetric rendering.

121. For these reasons, we decided to represent the scene using 3D Gaussians. Now, we need to figure out how to optimize the parameters of these 3D Gaussians and how to achieve real-time rendering of the scene.

122. We’ve transitioned to the problem of how to optimize the 3D Gaussians.

123. Firstly, what we need to optimize are 1) the position (p), 2) transparency (alpha), 3) the covariance matrix (sigma), and 4) the color information represented by SH coefficients.

124. In a multivariate Gaussian, the covariance matrix essentially determines the spread of the distribution. As you can see in the picture below, as the values in the covariance matrix increase, the ellipse representing the distribution becomes wider. For a more detailed explanation of the parameters of the 3D Gaussian, refer to this link.

125. Now, what about SH coefficients?

126. To put it somewhat vaguely, they’re something used for storing indirect lighting data. More precisely, they are a collection of functions used for data encoding.

127. There’s an article depicting this, and it uses the scenario of filling colors in a 2D image.

128. However, if we were to represent colors at every vertex like this, we’d need to store an overwhelming number of combinations. It’s not just about having an image in green and yellow; we’d need separate storage for inverted colors and various hues.

129. Instead of this tedious approach, we can define functions using the vertex coordinates. Below, you can see how adjusting coefficients R, G, and B allows us to determine RGB values using these simple equations.

130. With these straightforward equations and adjustments to coefficients, you can represent various colors, as shown below. The point is that you can represent a 2D image using equations rather than storing actual RGB values.

131. So, SH is essentially an attempt to apply this methodology to 3D. Using the coefficients of spherical harmonics to store colors at specific points, you can calculate colors by combining these coefficients with predefined spherical harmonics functions.

132. Here, spherical harmonics refer to a set of functions that take inputs of distance from the center (r), polar angle (θ), and azimuthal angle (φ) and produce an output value (P) at a point on a sphere’s surface.

Images from here

133. The formula for spherical harmonics Y looks like this:

Images from here

134. By changing the values of m and l in this formula, you get various spherical harmonics functions:

Images from here

135. Represented graphically, it looks like this:

Images from here

136. In summary, spherical harmonics involve defining multiple spherical harmonics functions by varying the values of m and l. These functions take angles (θ, φ) as input and output the values at specific points on the surface of a sphere.

137. Having obtained these coordinates, it means you can also represent colors using them.

Images from here

138. Looking at the equations, it’s like applying a weighting factor (k) to color C by multiplying it with the spherical harmonics function Y. This is why SH “coefficients” represent colors.

139. Now that we’ve covered how to determine the parameters of a Gaussian, let’s go back to how we optimize these four parameters.

140. First, before we optimize, we need to initialize the parameters. Therefore, we start by generating a point cloud using an algorithm like COLMAP.

141. Anyway, running COLMAP on the bicycle picture we saw earlier would yield something like this:

142. Now, we initialize the parameters from this point cloud:

143. M refers to the points obtained from SfM, which are used as the means of the 3D Gaussians.

144. These points from the point cloud are used as the means (μ) for the 3D Gaussians. In other words, for each point in the point cloud, we create a 3D Gaussian with its mean.

145. S represents the 3x3 covariance matrix of the 3D Gaussian.

146. C represents the color values of the 3D Gaussian, as seen in 73. The initial values are obtained by subtracting 0.5 and dividing by 0.28209 from the RGB values of the points obtained from SfM.

147. A represents the alpha values of the 3D Gaussian. The initial value is set to log(0.1/0.9) = -0.95 and the final value should be between 0 and 1.

148. Now that we’ve initialized M, S, C, and A, we need to move on to the actual optimization by running iterations. The algorithm related to this is shown in the image below.

149. First, it receives the camera pose V and the ground truth image I-hat.

150. Then, it uses the camera pose V to perform rasterization as mentioned earlier. This rasterization step is where the tile-based rasterizer, described in the paper, comes into play.

151. You can see that it takes the width and height of the image, along with M, S, C, A, and V as inputs, and starts with frustrum culling based on positions and camera poses.

152. Frustrum culling involves keeping only the 3D Gaussians that can be observed from the given camera V.

153. Frustrum culling involves creating a frustum from the camera center and removing Gaussians that are not visible (culled) because they are outside the frustum.

154. Then, it uses ScreenspaceGaussian() to project 3D Gaussians into 2D.

155. Now, the reason for calling it a tile-based rasterizer becomes clear: it divides the image into tiles.

156. It starts by taking a w x h-sized image and dividing it into 16x16-pixel tiles using CreateTiles.

157. However, when tiles are divided, there are likely Gaussians that overlap both Tile 1 and Tile 2. To handle these, the DuplicateWithKeys step duplicates Gaussians that overlap as many times as necessary (duplicate) and assigns each of them a 64-bit key consisting of a 32-bit view space depth and a 32-bit tile ID.

158. Gaussians with keys are then sorted using single GPU radix sort, as shown in the SortByKeys step.

159. After SortByKeys, each tile maintains a list of Gaussians sorted by depth.

160. Next, in the IdentifyTileRange step, we identify the start and end of Gaussians with the same tile ID, and create and manage Gaussian lists for each tile based on this information.

161. Now, as shown in the image below, each tile t is processed one by one.

162. For each tile t, it receives the tile range R obtained earlier, which is essentially the list of Gaussians.

163. Now, for each pixel i within a tile, it accumulates color and alpha values from front to back.

164. After all this, Rasterize(M, S, C, A, V) yields the rasterized image I.

165. Now, we need to compare this image I with the ground truth image I-hat to compute the loss. The loss is calculated using the formula shown below, with gamma set to 0.2 in the paper.

166. Alright, so now that we have calculated the loss in this manner, all that’s left is to update M, S, C, and A using the Adam optimizer. In the actual experiments described in the paper, they followed these steps: 1) To ensure stable optimization, they initially performed a warm-up with images at 1/4 resolution and gradually upsampled them by a factor of 2 after every 250 and 500 iterations. 2) When optimizing the spherical harmonics (SH) coefficients, they adopted a strategy of optimizing L0 for the first 1,000 iterations, L1 for the subsequent 1,000 iterations, and so on, up to L4. These are just implementation details.

167. So, we’ve completed the optimization process. Well, or so we thought.

168. There’s another ‘if’ statement coming up.

169. This phase is referred to as “adaptive control of Gaussians” in the paper. We must go through this as well.

170. In the image below, we’ve already gone through the projection and the differentiable tile rasterizer stages, comparing the resulting image with the ground truth and performing backpropagation. Now, we need to refine the 3D Gaussians further using the adaptive density control below.

171. Firstly, let’s address the ‘IsRefinementIteration(i)’ function. It checks whether iteration step ‘i’ requires refinement. While parameters like M, S, C, and A are updated every iteration, the entire 3D Gaussian grid is updated only once every 100 iterations. This check ensures we know when the 100th iteration is reached.

172. In the first ‘if’ statement, we remove the Gaussian if its alpha is lower than a specific threshold. The paper used a value of 0.005 for this purpose. This is the pruning phase.

173. Now, we need to perform densification, which falls into two categories: over-reconstruction and under-reconstruction. In both cases, the conclusion is that we need more Gaussians.

174. Over-reconstruction occurs when a single large Gaussian covers the entire space we want to represent.

175. In this case, we need to split this single Gaussian into multiple smaller ones. So, we use ‘SplitGaussian’ to divide one Gaussian into two. The scale is experimentally determined, usually set to 1.6, and the positions of the resulting two Gaussians are determined based on the initial Gaussian’s probability density values.

176. After applying split processing for over-reconstruction, the overall volume remains the same, but the number of Gaussians increases.

177. In contrast, under-reconstruction happens when there aren’t enough Gaussians to represent the desired space, resulting in blank spaces.

178. Here, we don’t split existing Gaussians. Instead, we need to add entirely new Gaussians. So, ‘CloneGaussian’ copies an existing 3D Gaussian and positions it along the gradient direction.

179. After applying clone processing for under-reconstruction, both the overall volume and the number of Gaussians increase.

180. Ultimately, during the densification process, the number of Gaussians always increases, sometimes nearly doubling. It’s like watching bacteria multiply over time.

181. To manage this, every 3,000 iterations, the alpha values of the Gaussians are reset to 0.

182. This way, while updating M, C, S, and A, alpha gradually increases on its own. When it comes time to update the 3D Gaussians after 100 iterations, Gaussians with alphas below the threshold are removed during the pruning phase.

183. That covers the entire process.

184. In summary, we start by 1) obtaining point clouds using Structure-from-Motion (SfM), 2) initializing 3D Gaussian values using these point clouds, 3) projecting them into 2D Gaussians, 4) passing them through a differentiable tile rasterizer to create image predictions, 5) comparing these predictions with the ground truth images, 6) backpropagating the loss to update the values, and 7) periodically using adaptive density control to shrink or expand the 3D Gaussians themselves for optimization.

185. This is 3D Gaussian splatting. Now, let’s return to our paper.

186. Up until now, we’ve converted the input text prompt into a point cloud using a 3D diffusion model, and then we’ve used 3D Gaussian splatting to render this point cloud into a 2D image.

187. We have completed up to the ‘splatting image’ part in the image below.

188. In the original 3D Gaussian splatting, we would calculate the loss between the rendered image and the ground truth image to update the 3D Gaussian.

189. However, in our paper, there is no ground truth 2D image to compare against.

190. So, we feed the generated image and the Gaussian-splatted image with the same prompt into a 2D diffusion model that denoises the input image, making it more natural according to the prompt.

191. Just as in 3D diffusion, where we updated with random point clouds, here we use 3D Gaussian rendered images and gradually improve them through 2D diffusion.

192. This update is performed by calculating the SDS loss. It looks like this.

193. “DreamFusion: Text-to-3D using 2D Diffusion” (Poole et al., 2022) has demonstrated a more stabilized diffusion loss that improves learning.

194. For reference, the original diffusion loss looks like this.

195. While it would be great to dive deeper into why this loss looks the way it does, I plan to explore it in a separate article when I take a closer look at the DreamFusion paper.

196. Anyway, this summarizes the entire process.

197. When the text prompt is initially received, it’s passed through the 3D diffusion model (corresponding to Shap-E in our paper) to create a point cloud that matches the prompt.

198. However, just using it as is won’t do, so we expand the points with noisy point growing and adjust their colors to match their surroundings with color perturbation.

199. We then initialize 3D Gaussian with the created point cloud.

200. So far, this is the content within the left dashed box, taking about 7 seconds in total.

201. Once the point cloud is initialized, we can render that point cloud into a 2D image using the 3D Gaussian splatting technique. This is the “splatting image” part in the figure.

202. The generated image is further denoised as the 2D diffusion model progresses through timesteps.

203. This allows us to update the 3D Gaussian to create more appropriate and natural images according to the prompt.

204. This marks the content of the right dashed box, where the training process is completed.

205. Finally, by utilizing 3D Gaussian splatting, we can generate the final 3D Gaussian splatting image. This step takes about 25 minutes.

206. When comparing our model’s performance to others, it not only takes less time but also produces significantly better quality.

207. The paper mentions ablation studies for 1) 3D Gaussian initialization, 2) noisy point growing, and 3) color perturbation, so if you’re interested, you can check those out. Ultimately, our model choice performed the best, so you can move on if you prefer.

This concludes the review, with the current content covering up to Number 207. I may add or edit information in the future to provide updates. This post is primarily for my personal understanding of the topic, and I kindly acknowledge that there may be errors or inaccuracies present. If you come across any, please feel free to bring them to my attention. Thank you for taking the time to read this, and congratulations🎉🎉!

--

--