NumByNum :: 3D Gaussian Splatting for Real-Time Radiance Field Rendering (Kerbl et al., 2023) Reviewed

Aria Lee
29 min readOct 10, 2023

This review of “3D Gaussian Splatting for Real-Time Radiance Field Rendering (Kerbl et al., 2023)” begins at Number 1 and concludes at Number 201. I may make additions or revisions to the content in the future for updates. I am creating this post primarily for my personal understanding of the subject, and I humbly acknowledge that errors or inaccuracies may be present. If you happen to identify any such issues, please do not hesitate to bring them to my attention. Thank you, and hope you enjoy😊!

1. I noticed a paper with 5,000 stars constantly appearing in the top tab on Papers with Code, so I got curious and decided to look it up.

2. It turns out it’s the paper titled “3D Gaussian Splatting for Real-Time Radiance Field Rendering” by Kerbl et al., published in 2023.

3. Funny enough, I initially thought the author’s name was Kerbi, so I clicked on it to check. Now, looking again, I realize it’s not “i” but “l.”

4. Kerbl… how adorable.

poyo poyo

5. The author hails from Inria (Institut National de Recherche en Informatique et en Automatique) and Universite Cote d’Azur, located in Nice, France.

6. And it seems the author has a thing for Gaussians…

7. Alright, enough fooling around; let’s dive into the actual paper.

8. In the field of 3D Computer Vision (3DCV), there’s a relatively recent approach, even though it dates back to 2017, called Implicit representation, which is exemplified by NeRF (Neural Radiance Fields). I’ve covered this before when InstantNeRF was introduced.

9. Implicit representation, in essence, suggests representing 3D shapes not through conventional means like meshes, point clouds, or voxels for rasterization but rather as a kind of formula or set of parameters.

10. For instance, in NeRF, if you feed specific coordinates into a neural network, it spits out the color and transparency of that point. Even though it doesn’t explicitly store each point, it believes that with the information it has, it can indirectly reconstruct the 3D shape.

11. Naturally, since it’s represented as a continuous scene, it’s advantageous for optimization and has a small memory footprint. However, it’s computationally intensive and takes a long time to train. Check out the materials from the previous publication for reference. (UPDATE: You can now access the presentation slides here!)

12. Since I’ve forgotten most of it, let’s briefly revisit NeRF. NeRF involves 1) constructing training data with N 2D images and corresponding camera position information, 2) using a fully-connected neural network that takes 5D coordinates (x, y, z, theta, phi) as input and outputs (color, density) for each point, 3) sampling 5D coordinates along the desired viewpoint’s radiance using hierarchical sampling, then passing these points through the NN to obtain color and density values, generating a new 2D image with volume rendering, and 4) optimizing the NN by calculating the loss against ground truth.

13. I’ve written extensively about this in the telegram NeRF channel, so if there’s something you don’t understand, please read that quickly. Honestly, I don’t remember any of it.

14. Anyway, since NeRF gained so much popularity, there has been an influx of NeRF variants over the past few years.

15. As I’ve summarized before, the main improvements have targeted 1) faster inference and 2) faster training, with various research topics branching out into 3) deformable 4) video 5) generalization 6) pose estimation 7) lighting 8) compositionality 9) editing 10) multi-scale, and more.

16. The models covered on the cover of the Kerbl paper are InstantNGP, Plenoxels, and Mip-NeRF360.

17. InstantNGP is a notable paper aiming for 2) faster training.

18. Calculating MLP output values for every given 3D coordinate individually is computationally intensive and slow to train. To address this, they encode coordinates using a hash map and linear interpolation.

19. The process involves 1) selecting the surrounding 4 coordinates for each given x-coordinate at different grid levels, 2) creating hash keys from the selected coordinates, 3) reading values from a hash table, 4) weighting the 4 values based on the distance between x-coordinate and the 4 coordinates, 5) concatenating feature vectors from each grid level with auxiliary values to obtain the final feature vector, and 6) using this as the input to the MLP layer instead of the initial coordinates.

20. I did a full presentation on this, so I would just skip the details.

21. Plenoxels also falls into the 2) faster training category.

22. What’s important to note in this paper is that, unlike other NeRF-related papers, they didn’t use neural networks. They argue that NeRF’s success is not due to neural networks but rather the differentiable volume renderer.

23. The pipeline is quite simple, really.

24. First, similar to the voxel grid-based representation used in InstantNGP, they represent the entire scene with a sparse voxel grid. Each corner of this voxel grid stores plenoptic representation.

25. It looks something like the image below:

26. Plenoptic representation refers to spherical harmonics coefficients and opacity at each position.

27. I’ll skip the details on spherical harmonics coefficients for now; we’ll get to that later. In simple terms, spherical harmonics are spherical harmonic functions used in graphics to calculate and represent light coming from various light sources in real-time. In other words, they’ve been used to represent colors.

28. If you want to dive deeper, take a quick look at section 73.

29. Anyway, by trilinearly interpolating the stored plenoptic information for each voxel, you can calculate view-dependent colors and use that for rendering, which is the main point of the paper.

30. So, what’s the efficiency point of Plenoxels?

31. It’s interpolation.

32. By interpolating the information stored in the sparse voxel grid, you can calculate plenoptic representations at any point efficiently. This means you can obtain information from arbitrary points, making it continuous.

33. You can find more details and figures explaining this here.

34. Finally, Mip-NeRF360 can be classified as belonging to the 10) multi-scale category.

35. If you read the NeRF paper, you’ll notice that the distance between the camera and objects is generally constant.

36. This implies that for NeRF to produce accurate outputs, the camera and objects need to be within a specific distance range.

37. When the camera positions and distances in the training and synthetic datasets are similar, they call it a bounded scene. On the contrary, when you need to create new images regardless of direction or distance, it’s called an unbounded scene.

38. NeRF performs well for bounded scenes but isn’t well-suited for unbounded scenes since it was trained on datasets with similar camera settings.

39. There are limitations that need to be addressed when dealing with unbounded scenes. In other words, things that have been limitations in creating unbounded scenes up to now need improvement.

40. Firstly, it’s related to parameterization.

41. When objects are close to the camera, they require more detailed representation. Conversely, when objects are far away, they don’t need as much detail. When you have a consistent distance, this isn’t an issue, but when there’s distance variation, it’s appropriate to allocate more parameter capacity to objects closer to the camera and less to those farther away.

42. In terms of efficiency, for unbounded scenes, you need larger MLPs because they cover larger areas and require more detail.

43. Now, to address these issues, when transitioning from NeRF to Mip-NeRF and Mip-NeRF 360, there are two main changes: 1) the encoding method and 2) the sampling method.

44. In NeRF, they used positional encoding to increase the dimensionality of coordinate values. Mip-NeRF and Mip-NeRF 360 use cone tracing techniques to overcome the limitation of NeRF being trained on fixed-scale photos taken from a similar distance.

45. I’m not entirely sure about the details, but they’re essentially using a cone frustum.

46. Regarding sampling, while NeRF (and Mip-NeRF) performed uniform coarse sampling followed by fine sampling based on a probability distribution, Mip-NeRF 360 shifts to a method that uniformly samples in the normalized contracted space (s-space) transformed based on camera distance and alpha composition weight. If you’re curious, read the paper because I didn’t look into it closely.

47. Anyway, let’s get back to our paper discussion.

48. Ultimately, in the main figure of our paper where these three models are compared, the authors are trying to convey that we achieve 1) rendering quality similar to or better than the state-of-the-art Mip-NeRF360 at high resolution (1920x1080) while being 2) faster than the state-of-the-art in training, InstantNGP, and 3) even capable of incredibly fast rendering at over 100 FPS.

49. By the way, you’ll come across the terms “rasterization” and “rendering” frequently. Rasterization refers to the process of mapping objects in a scene onto pixels on the screen. Rendering, on the other hand, encompasses the entire process of transforming 3D objects into a 2D scene on the screen. To be precise, rasterization seems to be a concept that falls under rendering.

50. So, what is 3D Gaussian splatting, and why did it emerge?

51. Going back to point 9 for a moment, the most common way to represent 3D shapes was through explicit representation like meshes, point clouds, or voxels. They had the advantage of allowing fast training and rendering, and it was easier to retrieve features from explicit data structures.

52. On the other hand, implicit representation, as seen in NeRF, involved stochastic sampling, making rendering computationally intensive. However, it excelled in optimization due to its continuity.

53. 3D Gaussian splatting is the fusion of these two advantages.

54. 3D Gaussian splatting represents the scene using Gaussians, much like how meshes use triangles for triangle rasterization.

55. While triangles are represented with just three coordinates, representing Gaussians requires a method.

56. So, they use four parameters: position (x, y, z), covariance (how it’s stretched/scaled: 3x3), color (RGB), and alpha to represent the Gaussian on the right.

57. Just as triangles are assembled to represent the entire shape, Gaussians work the same way.

58. When you combine three of them, it looks like this:

59. Attach about 7 million of these, and it becomes this:

60. We understand what a 3D Gaussian is and how it can be used for rasterization. Now, why is this better than the traditional mesh-based approach?

61. The key difference is that 3D Gaussians are differentiable. Being differentiable means optimization is easier. So, one of the reasons for choosing implicit representation is fulfilled.

62. Conversely, what’s better about it compared to using implicit representation? It’s still an explicit representation like meshes, making it suitable for rendering.

63. “Explicit representations like meshes are efficient for rendering in a differentiable manner but struggle with handling topology changes. Implicit representations like signed-distance functions, on the other hand, handle topology changes better but are more challenging to use for physics-based differentiable rendering.” Refer to this for more details.

64. In any case, the core objective is to find a method that is continuous and differentiable while also allowing fast rendering. Using 3D Gaussians accomplishes this.

65. 3D Gaussians fundamentally replace triangles in triangle rasterization with Gaussians, making them explicit while preserving the advantages of volumetric rendering.

66. “Our goal is to optimize a scene representation that allows high-quality novel view synthesis, starting from a sparse set of (SfM) points without normals. To do this, we need a primitive that inherits the properties of differentiable volumetric representations while being unstructured and explicit to allow very fast rendering. We choose 3D Gaussians, which are differentiable and can be easily projected to 2D splats, allowing fast alpha-blending for rendering.”

67. “Our choice of a 3D Gaussian primitive preserves properties of volumetric rendering for optimization while directly allowing fast splat-based rasterization. Our work demonstrates that — contrary to widely accepted opinion — a continuous representation is not strictly necessary to allow fast and high-quality radiance field training.”

68. For these reasons, we decided to represent the scene using 3D Gaussians. Now, we need to figure out how to optimize the parameters of these 3D Gaussians and how to achieve real-time rendering of the scene.

69. We’ve transitioned to the problem of how to optimize the 3D Gaussians.

70. Firstly, what we need to optimize are 1) the position (p), 2) transparency (alpha), 3) the covariance matrix (sigma), and 4) the color information represented by SH coefficients.

71. In a multivariate Gaussian, the covariance matrix essentially determines the spread of the distribution. As you can see in the picture below, as the values in the covariance matrix increase, the ellipse representing the distribution becomes wider. For a more detailed explanation of the parameters of the 3D Gaussian, refer to this link.

72. And, of course, the covariance matrix must be symmetric and positive definite. It’s an obvious requirement, but you can check the proof here.

73. Now, what about SH coefficients?

74. To put it somewhat vaguely, they’re something used for storing indirect lighting data. More precisely, they are a collection of functions used for data encoding.

75. There’s an article depicting this, and it uses the scenario of filling colors in a 2D image.

76. However, if we were to represent colors at every vertex like this, we’d need to store an overwhelming number of combinations. It’s not just about having an image in green and yellow; we’d need separate storage for inverted colors and various hues.

77. Instead of this tedious approach, we can define functions using the vertex coordinates. Below, you can see how adjusting coefficients R, G, and B allows us to determine RGB values using these simple equations.

78. With these straightforward equations and adjustments to coefficients, you can represent various colors, as shown below. The point is that you can represent a 2D image using equations rather than storing actual RGB values.

79. So, SH is essentially an attempt to apply this methodology to 3D. Using the coefficients of spherical harmonics to store colors at specific points, you can calculate colors by combining these coefficients with predefined spherical harmonics functions.

80. Here, spherical harmonics refer to a set of functions that take inputs of distance from the center (r), polar angle (θ), and azimuthal angle (φ) and produce an output value (P) at a point on a sphere’s surface.

Images from here

81. The formula for spherical harmonics Y looks like this:

Images from here

82. By changing the values of m and l in this formula, you get various spherical harmonics functions:

Images from here

83. Represented graphically, it looks like this:

Images from here

84. In summary, spherical harmonics involve defining multiple spherical harmonics functions by varying the values of m and l. These functions take angles (θ, φ) as input and output the values at specific points on the surface of a sphere.

85. Having obtained these coordinates, it means you can also represent colors using them.

Images from here

86. Looking at the equations, it’s like applying a weighting factor (k) to color C by multiplying it with the spherical harmonics function Y. This is why SH “coefficients” represent colors.

87. Now that we’ve covered how to determine the parameters of a Gaussian, let’s go back to how we optimize these four parameters.

88. First, before we optimize, we need to initialize the parameters. Therefore, we start by generating a point cloud using an algorithm like COLMAP.

89. COLMAP stands for “Structure from Motion” (SfM), which is an algorithm used to reconstruct the 3D structure of objects and camera poses from multi-view images of the same object taken from different angles.

90. While it’s conceptually similar to SLAM (Simultaneous Localization and Mapping), SLAM aims for real-time performance, so it’s lighter and computationally less intensive than SfM, which focuses on accuracy at the expense of computation.

91. The details of COLMAP are well explained in the paper and in this article. If you’re interested, take a look before coming back.

92. Anyway, running COLMAP on the bicycle picture we saw earlier would yield something like this:

93. Now, we initialize the parameters from this point cloud:

94. M refers to the points obtained from SfM, which are used as the means of the 3D Gaussians.

95. These points from the point cloud are used as the means (μ) for the 3D Gaussians. In other words, for each point in the point cloud, we create a 3D Gaussian with its mean.

96. S represents the 3x3 covariance matrix of the 3D Gaussian.

97. In point 70, the covariance matrix sigma was mentioned. This 3D Gaussian covariance matrix sigma is decomposed as follows: Here, S is the scale matrix, and R is the rotation matrix.

98. The scale matrix contains scaling information along the x, y, and z axes, and its initial values are derived from the square root of the distances to each point in the point cloud. These values are then log-transformed and copied into a 3x3 matrix.

99. The rotation matrix is obtained by representing rotations using quaternion notation. The 4x1 quaternion representation of rotation is then converted into a 3x3 rotation matrix R. The initial values involve creating four vectors for each point, with the first value set to 1 and the others to 0.

100. By the way, I’ve had some previous exposure to quaternions when I was curious about SLAM. I found an article and learned about it briefly. In a nutshell, quaternions provide a way to represent 3D rotations using a 4-dimensional vector (1 real part + 3 imaginary parts) instead of the conventional Euler angles.

101. I won’t go into the details again, but this video provides a very clear visualization, so be sure to check it out.

102. Anyway, regarding the rotation matrix, the reason for multiplying the obtained R and S matrices by their respective transposes is as mentioned in 72: the covariance matrix must be symmetric.

103. C represents the color values of the 3D Gaussian, as seen in 73. The initial values are obtained by subtracting 0.5 and dividing by 0.28209 from the RGB values of the points obtained from SfM.

104. A represents the alpha values of the 3D Gaussian. The initial value is set to log(0.1/0.9) = -0.95 and the final value should be between 0 and 1.

105. Now that we’ve initialized M, S, C, and A, we need to move on to the actual optimization by running iterations. The algorithm related to this is shown in the image below.

106. First, it receives the camera pose V and the ground truth image I-hat.

107. Then, it uses the camera pose V to perform rasterization as mentioned earlier. This rasterization step is where the tile-based rasterizer, described in the paper, comes into play.

108. You can see that it takes the width and height of the image, along with M, S, C, A, and V as inputs, and starts with frustrum culling based on positions and camera poses.

109. Frustrum culling involves keeping only the 3D Gaussians that can be observed from the given camera V.

110. Frustrum culling involves creating a frustum from the camera center and removing Gaussians that are not visible (culled) because they are outside the frustum.

111. “Our method starts by splitting the screen into 16×16 tiles, and then proceeds to cull 3D Gaussians against the view frustum and each tile. Specifically, we only keep Gaussians with a 99% confidence interval intersecting the view frustum. Additionally, we use a guard band to trivially reject Gaussians at extreme positions (i.e., those with means close to the near plane and far outside the frustum), since computing their 2D covariance would be unstable.”

112. Then, it uses ScreenspaceGaussian() to project 3D Gaussians into 2D.

113. Then how do you project a 3D Gaussian into 2D? The formula that describes this transformation of the covariance matrix from the world coordinate system (in this case, the 3D Gaussian) to the image coordinate system (in this case, the 2D Gaussian) is provided on page 4 of the paper.

114. The reason the formula looks like this can be summarized as follows.

115. First, J is the Jacobian of the projective transformation matrix that transforms camera coordinates into image coordinates. You can find how it looks here.

116. W is the viewing transformation matrix that converts the world coordinate system into the camera coordinate system. This is related to camera calibration, which I studied when I was looking into NeRF. I vaguely remember studying this with the help of this video, this article and this article. It’s not a very long read, so you might want to revisit it.

117. In short, the matrix that transforms world coordinates into camera coordinates (extrinsic parameters) and then into image coordinates (intrinsic parameters) is composed of two main parts.

118. Extrinsic parameters and intrinsic parameters.

119. Extrinsic parameters refer to the 3D translation and 3D rotation of the camera in space, which determine where the camera is located and where it’s pointing in 3D space. The matrix size is 4x4.

120. Intrinsic parameters refer to the transformation of camera coordinates into image coordinates, which depend on factors such as the camera lens and sensor position. It determines how much the image panel moves (2D translation), how much it scales (2D scaling), and how much it tilts (2D shear). The matrix size is 3x3, and it is related to camera focal length and the principal point’s position.

121. Anyway, by using these two matrices, we obtain the matrix W, which transforms world coordinates into camera coordinates, and then J, which further transforms the camera coordinates into image coordinates. This sequence allows us to obtain the covariance of 3D Gaussians in camera coordinates, which is then transformed into image coordinates.

122. To summarize, the formula in 113 transforms the covariance matrix sigma in the world coordinate system into camera coordinates using W and W_T, and then transforms it into image coordinates using J and J_T. After this process, 3D Gaussians in the world coordinate system are projected into 2D Gaussians in the image coordinate system.

123. Now, the reason for calling it a tile-based rasterizer becomes clear: it divides the image into tiles.

124. It starts by taking a w x h-sized image and dividing it into 16x16-pixel tiles using CreateTiles.

125. However, when tiles are divided, there are likely Gaussians that overlap both Tile 1 and Tile 2. To handle these, the DuplicateWithKeys step duplicates Gaussians that overlap as many times as necessary (duplicate) and assigns each of them a 64-bit key consisting of a 32-bit view space depth and a 32-bit tile ID.

126. Gaussians with keys are then sorted using single GPU radix sort, as shown in the SortByKeys step.

127. Notably, there’s no additional per-pixel ordering of points because blending is performed based on the initial sorting. This choice greatly enhances training and rendering performance without producing visible artifacts in converged scenes, especially as splats approach the size of individual pixels.

128. “Note that there is no additional per-pixel ordering of points, and blending is performed based on this initial sorting. As a consequence, our 𝛼-blending can be approximate in some configurations. However, these approximations become negligible as splats approach the size of individual pixels. We found that this choice greatly enhances training and rendering performance without producing visible artifacts in converged scenes.”

129. After SortByKeys, each tile maintains a list of Gaussians sorted by depth.

130. Next, in the IdentifyTileRange step, we identify the start and end of Gaussians with the same tile ID, and create and manage Gaussian lists for each tile based on this information.

131. Now, as shown in the image below, each tile t is processed one by one.

132. For each tile t, it receives the tile range R obtained earlier, which is essentially the list of Gaussians.

133. Now, for each pixel i within a tile, it accumulates color and alpha values from front to back, maximizing parallelism for data loading/sharing and processing. When a pixel reaches a target saturation of alpha (i.e., alpha goes to 1), the corresponding thread stops. At regular intervals, threads within a tile are queried, and processing of the entire tile terminates when all pixels have saturated.

134. “For rasterization, we launch one thread block for each tile. Each block first collaboratively loads packets of Gaussians into shared memory and then, for a given pixel, accumulates color and 𝛼 values by traversing the lists front-to-back, maximizing the gain in parallelism both for data loading/sharing and processing. When we reach a target saturation of 𝛼 in a pixel, the corresponding thread stops. At regular intervals, threads in a tile are queried and the processing of the entire tile terminates when all pixels have saturated (i.e., 𝛼 goes to 1).”

135. In contrast, to perform the backward pass of blending in the same sequential order, we utilize the Gaussian lists that were sorted earlier.

136. “During the backward pass, we must therefore recover the full sequence of blended points per-pixel in the forward pass. One solution would be to store arbitrarily long lists of blended points per-pixel in global memory. To avoid the implied dynamic memory management overhead, we instead choose to traverse the per-tile lists again; we can reuse the sorted array of Gaussians and tile ranges from the forward pass. To facilitate gradient computation, we now traverse them back-to-front.”

137. After all this, Rasterize(M, S, C, A, V) yields the rasterized image I.

138. Now, we need to compare this image I with the ground truth image I-hat to compute the loss. The loss is calculated using the formula shown below, with gamma set to 0.2 in the paper.

139. The L1 loss is clear, but the D-SSIM loss is likely included to incorporate structural similarity into the loss. D-SSIM could stand for “Difference of Structural Similarity.” There seems to be an implementation on GitHub.

140. For more details on SSIM, you can briefly skip to 166 and come back later.

141. Alright, so now that we have calculated the loss in this manner, all that’s left is to update M, S, C, and A using the Adam optimizer. In the actual experiments described in the paper, they followed these steps: 1) To ensure stable optimization, they initially performed a warm-up with images at 1/4 resolution and gradually upsampled them by a factor of 2 after every 250 and 500 iterations. 2) When optimizing the spherical harmonics (SH) coefficients, they adopted a strategy of optimizing L0 for the first 1,000 iterations, L1 for the subsequent 1,000 iterations, and so on, up to L4. These are just implementation details.

142. So, we’ve completed the optimization process. Well, or so we thought.

143. There’s another ‘if’ statement coming up.

144. This phase is referred to as “adaptive control of Gaussians” in the paper. We must go through this as well.

145. In the image below, we’ve already gone through the projection and the differentiable tile rasterizer stages, comparing the resulting image with the ground truth and performing backpropagation. Now, we need to refine the 3D Gaussians further using the adaptive density control below.

146. Firstly, let’s address the ‘IsRefinementIteration(i)’ function. It checks whether iteration step ‘i’ requires refinement. While parameters like M, S, C, and A are updated every iteration, the entire 3D Gaussian grid is updated only once every 100 iterations. This check ensures we know when the 100th iteration is reached.

147. In the first ‘if’ statement, we remove the Gaussian if its alpha is lower than a specific threshold. The paper used a value of 0.005 for this purpose. This is the pruning phase.

148. Now, we need to perform densification, which falls into two categories: over-reconstruction and under-reconstruction. In both cases, the conclusion is that we need more Gaussians.

149. Over-reconstruction occurs when a single large Gaussian covers the entire space we want to represent.

150. In this case, we need to split this single Gaussian into multiple smaller ones. So, we use ‘SplitGaussian’ to divide one Gaussian into two. The scale is experimentally determined, usually set to 1.6, and the positions of the resulting two Gaussians are determined based on the initial Gaussian’s probability density values.

151. After applying split processing for over-reconstruction, the overall volume remains the same, but the number of Gaussians increases.

152. In contrast, under-reconstruction happens when there aren’t enough Gaussians to represent the desired space, resulting in blank spaces.

153. Here, we don’t split existing Gaussians. Instead, we need to add entirely new Gaussians. So, ‘CloneGaussian’ copies an existing 3D Gaussian and positions it along the gradient direction.

154. After applying clone processing for under-reconstruction, both the overall volume and the number of Gaussians increase.

155. Ultimately, during the densification process, the number of Gaussians always increases, sometimes nearly doubling. It’s like watching bacteria multiply over time.

156. To manage this, every 3,000 iterations, the alpha values of the Gaussians are reset to 0.

157. This way, while updating M, C, S, and A, alpha gradually increases on its own. When it comes time to update the 3D Gaussians after 100 iterations, Gaussians with alphas below the threshold are removed during the pruning phase.

158. That covers the entire process.

159. In summary, we start by 1) obtaining point clouds using Structure-from-Motion (SfM), 2) initializing 3D Gaussian values using these point clouds, 3) projecting them into 2D Gaussians, 4) passing them through a differentiable tile rasterizer to create image predictions, 5) comparing these predictions with the ground truth images, 6) backpropagating the loss to update the values, and 7) periodically using adaptive density control to shrink or expand the 3D Gaussians themselves for optimization.

160. So, how well does this all perform in practice?

161. The Mip-NeRF360 dataset comprises high-resolution photos captured outdoors and indoors, featuring scenes like bicycles, gardens, and kitchens. It includes nine classes, five of which are outdoors and four indoors, often with complex central objects and detailed backgrounds. The images have a resolution of 4946x3286. Notably, the paper’s key figures, such as those depicting bicycles and potted plants, are sourced from this dataset.

162. The Tanks&Temples dataset, as the name suggests, consists of photos taken both inside and outside tanks and temples. You can find detailed information about its composition in the paper.

163. The Deep Blending dataset is the one used in this paper, containing approximately 2,630 images across 19 scenes. It appears to be a comprehensive dataset for the experiments.

164. Additionally, there’s a brief summary of other datasets frequently used in NeRF-related papers, which you can explore further if needed.

165. Now, let’s talk about the evaluation metrics: SSIM, PSNR, and LPIPS.

166. SSIM stands for Structural Similarity Index Measure. It assesses the similarity between two images in terms of luminance, contrast, and structure.

167. To break it down further, when we compare an original image (X) with a target image (Y), SSIM considers: 1) If the difference in RGB mean values increases, luminance decreases. 2) If the difference in standard deviations increases, contrast decreases. 3) If the differences between pixel values and the means are in opposite directions, structure decreases.

168. However, SSIM has limitations, especially when dealing with blurred images. Even with significant blurring, SSIM can still yield high scores. In any case, higher SSIM scores indicate better similarity.

169. PSNR stands for Peak Signal-to-Noise Ratio, used to evaluate the amount of image quality loss. Higher PSNR values indicate better image quality, just like SSIM.

170. Looking at the equation, as the MSE decreases, the pixel error between the original image and the comparison image becomes smaller, which in turn results in a higher PSNR value.

171. LPIPS, or Learned Perceptual Image Patch Similarity, calculates similarity between two images by analyzing feature values extracted from VGG networks. See here.

172. In summary, lower LPIPS scores mean that the network’s output is perceptually closer to the ground truth.

173. Now, let’s revisit the charts. First, we have the performance on the Mip-NeRF360 dataset.

174. We are consistently within the top three, and are the best except for the PSNR test.

175. Next, we have the results for the Tanks&Temples dataset.

176. Similarly, we are consistently within the top three, except for ‘Ours-7K,’ which didn’t make it to the top three in PSNR. As mentioned earlier, it’s intriguing that we perform well in SSIM but not in PSNR. I might be updating on this.

177. Lastly, here are the results for the Deep Blending dataset.

178. As you can see, we have achieved consistently strong and impressive results.

179. Finally, there’s an ablation study focusing on key choices.

180. ‘Limited-BW’ refers to cases where gradient computation is limited to a certain number of Gaussians (e.g., N=10).

181. If we restrict the number of points that receive gradients, the impact on visual quality is quite significant. On the left, you can see the results when we limit the number of Gaussians receiving gradients to 10. In this particular test, we chose N=10, which is twice the default value in Pulsar. However, this led to unstable optimization due to significant approximations in gradient computations.

182. Now, let’s talk about “Random Init.” This approach involves initializing the system not with point clouds obtained from Structure-from-Motion (SfM) but by uniformly sampling points.

183. “We’ve also assessed the importance of initializing the 3D Gaussians from SfM point clouds. For this ablation study, we uniformly sampled a cube with dimensions three times the size of the input camera’s bounding box. Surprisingly, our method still performs quite well, even without the SfM points. The degradation primarily occurs in the background.”

184. “No-Split” corresponds to a scenario where we skip the splitting of large Gaussians during the densification process, primarily in cases of over-reconstruction. Similarly, “No-Clone” refers to not using the cloning process, especially when dealing with under-reconstruction.

185. “When we disable each of these methods separately while keeping the rest of the method unchanged, we can observe interesting results. Splitting large Gaussians is crucial for achieving good background reconstruction, as seen in Figure 8. On the other hand, cloning small Gaussians instead of splitting them allows for better and faster convergence, especially when dealing with thin structures in the scene.”

186. “No-SH” indicates a scenario where spherical harmonics (SH) are not used in the process.

187. “Isotropic” signifies that the 3D Gaussians are designed to be isotropic rather than anisotropic.

188. If you’re unsure about what these terms mean, let’s take a moment to define them. See here.

189. So, in simple terms, a standard 2D Gaussian distribution typically looks like this. See here for the figure.

190. When we make the covariance matrix diagonal, the Gaussian distribution transforms as shown here.

191. And when we make a 2D Gaussian isotropic, it changes like this.

192. The reason we limit the covariance matrix to be diagonal or make it isotropic, meaning it’s proportional to the identity matrix, is that as data dimensions increase, the number of parameters grows significantly, making computations more challenging. However, reducing the flexibility of the model by constraining it like this results in a decrease in representational power, especially for complex scenes. This is why our paper emphasizes this point.

193. “An important algorithmic choice in our method is the optimization of the full covariance matrix for the 3D Gaussians. To demonstrate the effect of this choice, we perform an ablation where we remove anisotropy by optimizing a single scalar value that controls the radius of the 3D Gaussian on all three axes. The results of this optimization are presented visually in Fig. 10. We observe that the anisotropy significantly improves the quality of the 3D Gaussian’s ability to align with surfaces, which in turn allows for much higher rendering quality while maintaining the same number of points. In comparison to previous explicit scene representations, the anisotropic Gaussians used in our optimization are capable of modelling complex shapes with a lower number of parameters.

194. Now that we’ve covered this section, let’s briefly touch upon the improvements suggested for 3D Gaussian Splatting in the paper.

195. Firstly, our method, like others, struggles in regions where the scene is not well-observed.

196. “Our method is not without limitations. In regions where the scene is not well observed we have artifacts; in such regions, other methods also struggle (e.g., Mip-NeRF360 in Fig. 11).”

197. Anisotropic Gaussians were utilized, but it was also mentioned that splotchy Gaussians can appear. “Splotchy” simply means having irregular patches or spots. As with previous techniques, splotches may appear in areas where the training was not well-conducted.

198. “Even though the anisotropic Gaussians have many advantages as described above, our method can create elongated artifacts or “splotchy” Gaussians; again, previous methods also struggle in these cases.”

199. “We also occasionally have popping artifacts when our optimization creates large Gaussians; this tends to happen in regions with view-dependent appearance. One reason for these popping artifacts is the trivial rejection of Gaussians via a guard band in the rasterizer. A more principled culling approach would alleviate these artifacts. Another factor is our simple visibility algorithm, which can lead to Gaussians suddenly switching depth/blending order. This could be addressed by antialiasing, which we leave as future work. Also, we currently do not apply any regularization to our optimization; doing so would help with both the unseen region and popping artifacts.”

200. Just to clarify, the “guard band” mentioned here was used in the rasterizer step to reject Gaussians that were positioned extremely far from the scene.

201. “Popping” artifacts refer to undesirable visual effects that occur when there’s an abrupt and noticeable transition between different levels of detail in a 3D object. So, it seems like simply using a guard band to remove a portion of the Gaussian can lead to unnatural transitions.

This concludes the review, with the current content covering up to number 201. I may add or edit information in the future to provide updates. This post is primarily for my personal understanding of the topic, and I kindly acknowledge that there may be errors or inaccuracies present. If you come across any, please feel free to bring them to my attention.

Thank you for taking the time to read this, and congratulations🎉🎉!

--

--