NumByNum :: What You See is What You GAN — Rendering Every Pixel for High Fidelity Geometry in 3D GANs (Trevithick et al., 2024) Reviewed

25 min readJan 5, 2024

This review of “What You See is What You GAN — Rendering Every Pixel for High Fidelity Geometry in 3D GANs (Trevithick et al., 2024)” begins at Number 1 and concludes at Number 158. I may make additions or revisions to the content in the future for updates. I am creating this post primarily for my personal understanding of the subject, and I humbly acknowledge that errors or inaccuracies may be present. If you happen to identify any such issues, please do not hesitate to bring them to my attention. Thank you, and hope you enjoy😊!

1. I’ve got a fresh paper that relates to the 3D models I looked into before and is also tied to the GAN I was planning to sort out separately, so I thought I’d tackle it all at once.

2. Right away, it’s “What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs” (Trevithick et al., 2024). It’s hot off the press, hitting arXiv on January 5th, so it’s genuinely sizzling. This is a brand-new paper.🔥

3. Firstly, we’ll dive into 1) what problems this paper is trying to solve. In that process, we’ll review how the previous 3D GAN-related studies have been conducted and what limitations they’ve faced.

4. Next, we’ll explore 2) StyleGAN and StyleGAN2 related to the ‘generation’ part of 3D generation and delve into 3) VolSDF linked to the ‘3D’ part. We’ll briefly summarize NeRF as well. After grasping these foundational studies, we’ll then delve into 4) how our paper’s model structure is structured.

5. Honestly, the paper itself is articulated very neatly, making it approachable for those interested in this field. If you’re already familiar with the papers mentioned in Num 4, it might be worth reading this paper yourself. The explanations are friendly, making it not so hard to follow.

6. Admittedly, even with a meticulously organized paper, I’ll probably still take detours and wander off the beaten path! 😒🔍

7. Well then, let’s get started.

8. First off, we need to properly re-read our paper’s title.

9. The subtitle is “Rendering Every Pixel for High-Fidelity Geometry in 3D GANs.”

10. We’ve touched on the basic concept of GANs in our previous Stable Diffusion paper review. Before Stable Diffusion, we briefly mentioned various image synthesis models that existed, categorizing them into 1) GAN-based and 2) likelihood-based, briefly referencing the Generative Adversarial Networks (Goodfellow et al., 2014), where a generator creates fake samples and a discriminator distinguishes between real and fake, allowing the generator to produce more realistic fake images over time through competition.

11. We also made an analogy comparing this situation to when a forger makes counterfeit money, and the police determine whether it’s counterfeit or not. As a result, the forger evolves to create more sophisticated counterfeit money, similar to the situation with GANs.

12. 3D(-aware) GANs use multiple 2D images to create 3D geometry and generate multi-view-consistent 2D images based on this. They reconstruct actual 3D objects from given 2D images, enabling synthesis of novel views using the obtained 3D representation, and they aim to do this using GANs instead of other methods.

13. Typically, they represent 3D representation using neural fields and feature grids and render 3D objects into 2D images using NeRF.

14. We’ve previously glanced at these concepts bit by bit, but since it’s possible that they might have slipped from memory, let’s briefly revisit the concepts.

15. Firstly, let’s properly define what a neural field is.

16. Among the papers I’ve been thinking about separately to write about, there’s the “Neural Fields in Visual Computing and Beyond” (Xie et al., 2022) paper. There’s a lot to grasp, particularly related to 3DCV, so I’ve been considering summarizing it sometime. Coincidentally, there’s a section explaining what a neural field is, which I’ve brought here. Let’s take a look at the content below.

17. To understand the neural field, first, we need to grasp what a “field” means. It might be familiar, but in simple terms, a field refers to a physical quantity that varies at each point in space.

18. For instance, if there’s a function representing the distribution of temperature in space, assigning a single number (representing temperature) to each point makes it a field. Hence, Definition 1 above expresses a field as a value given by a function over time and space.

19. Let’s look at the figure above. For instance, an image represents a vector field that, when given 2D (x, y) coordinates, is determined by RGB values. If you’ve studied SDF, you’d know it represents the distance from a point to the surface of a 3D object; when a point in n-dimensions is given, the signed distance to the surface is represented by a single number, making it a scalar field.

20. Therefore, a neural field refers to a field whose values are determined by a neural network.

21. Now, the title of the NeRF paper, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis,” becomes clearer. When inputting position coordinates and viewing directions, it returns RGB values and density for that coordinate. NeRF accomplishes this using MLP.

22. In summary, a neural field simply means assigning values for each point in space, where the values are determined by a neural network. This might be a bit more precisely defined, but let’s stick to this level of understanding for now and move on.

23. Moving forward, a feature grid literally means directly storing features in a grid. Here’s a picture illustrating what a feature grid is, which intuitively demonstrates it. Let’s take a look below.

24. Although this figure is from another paper, we’ll focus solely on what a feature grid is.

25. When we typically feed a 2D image into a CNN, we obtain a feature map. The colorful sections labeled as 2D feature maps in the image above represent that.

26. A 3D feature grid is simply elevating this concept by one dimension. In other words, if a 2D feature map transfers features from each point in an image to a 2D grid, a 3D feature grid conveys features from each point in a 3D object to a 3D cube.

27. Referring back to Num 13, we described the usual approach of general 3D GAN models as using neural fields and feature grids to represent 3D shapes and NeRF to render them into 2D images. Now, we’re examining the definitions of neural fields and feature grids.

28. Now, let’s revisit what NeRF is. We’ve mentioned it before in the 3D Gaussian Splatting paper review.

29. NeRF involves 1) creating training data with N 2D images and corresponding camera location information, 2) constructing a separate fully-connected neural network for each point’s 5D coordinates (x, y, z) and direction coordinates (theta, phi) to output the point’s (color, density), 3) using hierarchical sampling along the desired viewpoint’s radiance, obtaining 5D coordinates, feeding these points into the NN to obtain color and density values, and finally using volume rendering to generate new 2D images from different angles, and 4) optimizing the NN by computing the loss with ground truth data.

30. We’ll probably cover NeRF separately in another article, but for now, let’s understand it at this level since the journey ahead is quite long.

31. Now, let’s return to Nums 12 and 13. We briefly touched on how 3D(-aware) GANs create 3D geometry using multiple 2D images and generate multi-view-consistent 2D images, utilizing a GAN-based approach in the process. Additionally, we mentioned that previous 3D GAN models usually represent 3D shapes using neural fields and feature grids and use NeRF for 3D-to-2D rendering.

32. However, these papers have encountered a common challenge.

33. Specifically, neural volume rendering consumes excessive memory and computation.

34. As we mentioned before, calculating the output of an MLP for every given 3D coordinate involves significant computational overhead, leading to slower training speeds. We discussed this when we talked about NeRF’s emergence and the subsequent influx of improvements, such as InstantNGP and Plenoxel.

35. Similar issues persist here. Due to the extensive computational demand and time-consuming nature of NeRF, subsequent research efforts have either 1) operated the model at low resolutions, using post-processing with 2D CNNs, or 2) aimed to maintain high-resolution performance while enhancing computational efficiency.

36. The limitation of the former approach is its tendency to generate view-inconsistent results. Supplementing high-frequency details after rendering the created 3D geometry into images may compromise overall 3D consistency and quality.

37. Therefore, focusing on 3D consistency, the latter approach attempts 1) to increase rendering speed or 2) to apply patch-based training.

38. Let’s briefly explore attempts to boost rendering speed.

39. One of the unique characteristics of a 3D scene is the relatively low density of information. If we enclose an object within a 512 x 512 cube, although certain parts might contain information, numerous cube cells may remain completely empty. As the dimensions increase, empty spaces emerge. Exploiting this sparse feature of 3D scenes aims to enhance rendering speed.

40. Naturally, this approach may be more efficient. However, the drawback lies in its performance being acceptable only under specific conditions, limiting the diversity of generated scenes and viewing angles.

41. The second approach involves patch-based training to reduce the area to be considered at once. Referencing a figure from the paper “Mimic3D: Thriving 3D-Aware GANs via 3D-to-2D Imitation” (Chen et al., 2023), you can observe the division of images into patches.

42. This method intuitively confines the receptive field to local patches, potentially making it challenging to create accurate representations.

43. However, our paper differs in that it 1) remains highly generalizable without limiting scenes and 2) does not break down the entire image into pieces but examines it all at once. Thus, our geometry and rendering results align well.

44. At this point, we’ve explored branches of research aiming to improve neural volume rendering in 3D GANs. Yet, naturally, enhancing 3D GANs won’t focus solely on rendering. There are research directions aiming to alter the way 3D objects are represented, diverging from the conventional use of Radiance fields to represent 3D objects and instead employing implicit surfaces like SDF.

45. Further research has been conducted to enhance the computational efficiency and resolution of implicit surface representations. One such study is VolSDF, utilizing an implicit surface representation, and another is Adaptive Shells, an improvement upon implicit surface representation.

46. We’ve used these two in our paper. So, for now, understanding this flow of ideas should suffice. We’ll delve deeper into how these function specifics shortly.

47. Having reached this point, we’ve briefly touched on the attempts made in the field of 3D GANs before our paper. Now, let’s see what direction our paper has chosen.

48. Ultimately, our aim is to perform 3D generation, reconstructing 3D objects from 2D images. Specifically, in our proposed model, the ‘generation’ aspect relates to StyleGAN and StyleGAN2, while the ‘3D’ aspect connects to the previously mentioned VolSDF. Therefore, to better comprehend our paper, we’ll briefly examine these papers first.

49. Especially, StyleGAN stands out as a crucial paper, worthy enough to merit its separate discussion.

50. So, let’s start with StyleGAN. It’s the paper titled “A Style-Based Generator Architecture for Generative Adversarial Networks” (Karras et al., 2019).

51. Let’s revisit what was mentioned in Num 10. We previously discussed that Generative Adversarial Networks (GAN) involve the competition between a generator, creating fake samples, and a discriminator, distinguishing between real and fake, allowing the generator to produce more realistic fake images over time.

52. However, let’s assume we want to train a GAN with high-resolution images of 1024 x 1024. Intuitively, training the discriminator would be too easy because the quality between real images and model-generated images would be vastly different.

53. Therefore, initially, training proceeded at 4 x 4 to learn overall low-level features, progressively increasing to 1024 x 1024, improving the stability of training for generating high-resolution images.

54. This model is known as PGGAN, proposed in the “Progressive Growing of GANs for Improved Quality, Stability, and Variation” (Karras et al., 2018) paper by the same author. As the scale gradually increased, it indeed became more progressive.

55. However, PGGAN has one problem.

56. It’s the issue of entanglement.🤔

57. Let’s take a look at the image above. It perfectly illustrates what entanglement is in the (coursera) lecture material above.

58. Our aim is to solely introduce the feature of a ‘beard’ into the image of the woman at the top left, resulting in an image of a woman with a beard on the top right. Can we discover a new feature vector z’ by adjusting the existing feature vector z to specifically incorporate ‘only a beard’ into the woman’s image?

59. The answer is no.😞 It’s impossible in the conventional GAN because the ‘beard’ feature is entangled with the ‘man’ feature. Let’s explore why this phenomenon occurs while looking at the image below.

60. The leftmost shape (a) represents the feature distribution of the actual training set. Let’s assume that moving towards the right on the x-axis represents the ‘woman’ feature, and moving upwards on the y-axis represents the ‘mustache’ feature. Since ‘women with mustaches’ are less common in the training data, the top left is empty.

61. However, a typical GAN distorts this empty space in the feature distribution (a) into the image feature Z space, as shown in (b). Naturally, when the axes are twisted, the original shape collapses (entangles).

62. StyleGAN, on the other hand, uses a mapping network that maps to a latent space W, similar in shape to the training set distribution, avoiding entanglement, as in (c).

63. While the previous GAN generator could only take one z as input and couldn’t selectively adjust specific features, StyleGAN allows for changing only the desired features by modifying the W values in the disentangled latent space. That’s why it’s called StyleGAN, enabling the modification of specific styles.

64. This is the most fundamental idea behind StyleGAN.

65. Now, let’s examine how the actual model architecture looks.

66. The style-based generator (b) on the dashed right represents the structure of StyleGAN. Upon closer inspection, it’s divided mainly into 1) a mapping network f containing many Fully Connected (FC) layers and 2) a synthesis network g containing Adaptive Instance Normalization (AdaIN).

67. The mapping network f is a network that maps a latent vector z sampled from the entangled (Gaussian) latent space Z to w in the disentangled latent space W. In simple terms, as discussed earlier, it transforms entangled features (z) into disentangled features (w).

68. The structure is simple. If the network receives a 512-dimensional z, it undergoes 8 MLP layers to be straightforwardly mapped to a 512-dimensional w. This corresponds to the left part in the figure.

69. Now that we have the latent vector w, it’s time to generate (synthesize) images using this w. That’s the role of the synthesis network g. Let’s look at the image below.

70. As previously mentioned, StyleGAN is based on a Progressive Growing GAN (PGGAN) that sequentially increases resolution from 4x4 to 1024x1024. Observing the image, we can see that the synthesis network of StyleGAN consists of a total of 9 (gray) style blocks, from 4x4 to 1024x1024.

71. Latent vector w passes through A and enters each style block. A is called the learned affine transform. It’s a trainable parameter, and we’ll delve into its role shortly.

72. Now, let’s delve deeper into the gray style blocks. After upsampling, they undergo AdaIN, then a convolution, followed by another AdaIN.

73. AdaIN is a layer frequently used in the style transfer field.

74. Let’s briefly look at the image above. The fundamental assumption of style transfer is the ability to separate 1) the actual content and 2) the style from an image. For instance, in the given example, the building represents the content, and the painting style of Van Gogh represents the style. Therefore, it’s possible to generate an image retaining the building’s content while incorporating Van Gogh’s style into it.🎨

75. AdaIN utilizes the mean and variance of the feature space to extract only the content of the image and applies the style to perform style transfer. Take a look at the equation below.

76. The detailed theoretical aspects will be covered later when discussing “Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization” (Huang et al., 2017) paper. For now, accepting it intuitively should suffice.

77. AdaIN assumes it can determine the style somewhat based on the mean and standard deviation of the feature space. Hence, it represents the style using the mean and standard deviation of the feature space.

78. In the equation, x represents our image, μ(x) signifies the mean of the feature space, and σ(x) represents the standard deviation of the feature space. So, the term normalized from the image x means extracting the content, leaving the normalized part, i.e., the content-only, from our image.

79. Conversely, multiplying it by σ(y) and adding μ(y) infuses the style of image y into the content-only image x.

80. Here, σ(y) and μ(y), through the learned affine transformation mentioned earlier (in 71), become style vectors y_i = [σ(y_i), μ(y_i)] from the obtained w via the mapping network f.

81. Returning to the previous image, we transformed w through A into style vector y. This style vector y takes on the role of style in style transfer.

82. But what about the content? Who manages that?🤔

83. Right on the far right, the output after the convolution layer represents the content x_i. In the image, observe that x_i remains as the normalized channel of the light blue box. This content undergoes multiplication and addition by the style vector, generating the final output of the style block until it reaches 1024x1024, repeated nine times in the StyleGAN model structure.

84. However, StyleGAN encounters several issues. Later on, we’ll delve into them, but the most notable problem is the unintended noise resembling water droplets present in the generated images.

85. To address these issues, the same author proposed an improved version called StyleGAN2 in the paper “Analyzing and Improving the Image Quality of StyleGAN” (Karras et al., 2019).

86. This paper aims to eliminate strange artifacts in StyleGAN-generated images and upgrade the model. Therefore, to begin, let’s examine the issues arising in the original StyleGAN. Let’s see below:

87. The red-circled area denotes the water droplet-like artifacts mentioned in the paper, appearing consistently in all images generated by StyleGAN.

88. However, peculiarly, by merely removing the normalization step in StyleGAN and regenerating the images, these artifacts don’t appear at all. Here, the normalization step refers to the part where, as seen earlier, AdaIN layers subtract the mean and divide by the variance of the feature map.

89. Therefore, StyleGAN2 opted to use expected statistics instead of explicitly utilizing the feature map’s statistics for normalization. Let’s look below:

90. On the left side, we see the original StyleGAN structure. The part we should pay attention to is (c). In the blue box, you’ll notice it divides by the standard deviation, unlike the previous AdaIN method, which also normalized by the mean.

91. Additionally, the original model (which we skipped over) involved a noise broadcast operation, adding finely detailed and stochastic variations (e.g., pores, freckles, etc.) sampled from a Gaussian to the feature map. However, in the revised model, that B is now positioned outside the style block. Similarly, the bias, b, has also moved out of the block.

92. But that’s not all. Now, from (c) to (d), we’ll make a few more modifications.

93. The key change here is entirely removing instance normalization and replacing it with demodulation. In other words, we’ll not only stop at not subtracting the mean but also skip the step of dividing by the standard deviation.

94. Instead, after obtaining the scaling factor σ(y_i) from A, we multiply the convolution weight by this scaling factor and then scale the convolution weight using L2 norm. Similarly, each feature map j also scales using its standard deviation. Let’s refer to the following simple equations:

95. This modification eliminates direct feature map normalization, instead normalizing the convolution weights. By making this change, StyleGAN’s consistent issue, the water-droplet artifacts, completely disappear.

96. This is a highly condensed overview, and there are many more points to explore in the actual paper. But for our purposes today, this should be sufficient for reading our paper. Let’s conclude here.

97. Now, let’s briefly revisit Num 48. Ultimately, our aim is to perform 3D generation, reconstructing 3D objects from 2D images. Our paper involves the ‘generation’ aspect related to StyleGAN and StyleGAN2 and the ‘3D’ part connected to VolSDF. Hence, we’re skimming through papers related to StyleGAN for context.

98. Once we’ve covered StyleGAN-related papers, we’ll be ready to focus solely on VolSDF and reviewing VolSDF will wrap up our preparation. Let’s take a quick look at “Volume Rendering of Neural Implicit Surfaces” (Yariv et al., 2021), as the model’s name is VolSDF.Almost there!

99. Firstly, let’s understand what SDF means since the model’s name is VolSDF.

100. In essence, it’s simply a field where the signed distance is given as a value.

101. Let’s consider that we have an empty 3D object. Then, if we assign the shortest distance from each point in space to the object’s surface as a value, placing a plus sign for points inside the object surface and a minus sign for points outside, we can assign a unique value to each point. This is the signed distance.

102. By knowing this, we can determine where the surface of the 3D object is and naturally obtain a representation of the 3D object. In this way, we obtained a signed distance field that represents a specific 3D object.

103. Returning to VolSDF, its principle is quite straightforward.

104. The third image, with contour-like lines, represents the signed distance function we discussed. This SDF is actually transformed into volumetric density required for rendering through the process described in the image below, which corresponds to the second bull image in Num 103.

105. The reason why this transformation is valid is explained in section 3.1 of the paper, “Density as transformed SDF” To avoid making this article too lengthy for today, let’s move forward for now. If you’re curious, it might be beneficial to read the explanation in the paper later on. Let’s just understand that we’ve obtained density for now.

106. Now that we’ve obtained volumetric density and a signed distance function from the input images on the far left, all that’s left is to render the completed 3D representation to get the image on the far right. The paper provides detailed explanations, and there are considerable overlaps with NeRF, so reading section 3.2, “Volume rendering of σ” might be worthwhile.

107. Before delving into our paper’s pipeline, we briefly explored some background models that would be good to know. While much of it has been condensed for brevity’s sake, we can dive deeper into these aspects later. For now, let’s buckle up and dash through our paper.🚀

108. Let’s start by looking at the image below. This depicts our paper’s entire pipeline.

109. The leftmost part, labeled as triplane T, can be understood by referring to the “Efficient Geometry-aware 3D Generative Adversarial Networks” (Chan et al., 2022) paper, which our paper references.

110. Looking at the model architecture of this paper, it aligns explicit features to three axis-aligned orthogonal feature planes. Then, given any coordinates (x, y, z), it projects those coordinates onto each plane to obtain the feature vector (F_xy, F_xz, F_yz), which is then aggregated and passed to the next renderer.

111. In the above image, (a) NeRF is entirely implicit in representing the scene, resulting in slower computations, and (b) explicit voxel-based methods face difficulties in increasing resolution. Therefore, © this method is a suitable compromise between the two.

112. Similarly, we obtain three axis-aligned 2D feature grids (f_xy, f_xz, f_yz) using this approach. Here, the previously discussed StyleGAN2 plays a crucial role. It’s responsible for obtaining the 2D triplane representation (f_xy, f_xz, f_yz) by introducing noise z.

113. To be precise, a 512-dimensional z passes through a mapping network to acquire w = f(z). This w then passes through a synthesis network to obtain T(w) of size 3 x 32 x 512 x 512, which essentially becomes three 512 x 512 resolution feature grids with a feature size of 32. This creates the Triplane.

114. So, we name this initially obtained triplane T’.

115. Now, we’ll refine this T’ to create a more expressive triplane feature, denoted as T. If we label the features of T’ as f’_ij, then each f’_ij goes through a synthesis block to create f_ij = SynthesisBlock(f’_ij) separately for each orthogonal plane. This not only makes the features more expressive but also helps in eliminating entanglement among the planes by applying separate plane-specific kernels.

116. Now, let’s apply the concepts learned from VolSDF.

117. Given the triplane T and 3D coordinate positions x, we pass each coordinate x through an MLP along with the positional encoding value and the corresponding triplane T’s value.

image from here

118. This MLP takes both values and returns SDF value (s), variance (β), and geometry feature (f_geo). Here, β corresponds to the β mentioned briefly when discussing VolSDF…

119. It’s the same as the β in Equation (3) mentioned as Laplacian CDF. Understanding it as the ‘tightness’ of the representation would be sufficient.

120. With the equation (which converts SDF to density) and the output of the MLP (s, β, f_geo), we can assign volume to each coordinate x. We can calculate the opacity σ for each point by plugging in s and β into Equation (3).

121. However, each point’s radiance c doesn’t rely solely on its opacity; it also considers geometry features and the viewing direction, similar to NeRF.

122. So, finally, we have an additional MLP that takes both the viewing direction and f_geo to return the final radiance c.

123. Let’s briefly recall the explanation for NeRF in Num 29.

124. Now, we’ve managed to obtain the radiance c and opacity σ in a different way than NeRF for volumetric rendering. To clarify, we’ve completed all the preparations for rendering.

125. The actual formula used for volumetric rendering is shown below.

126. It seems familiar, as we’ve encountered a similar formula in NeRF.

127. Looking at our rendering equation, it appears quite similar.

128. Here, δ represents the interval between samples along the ray. So, the expression for T_i in the second line implies that 1) the more opaque the points ahead are (i.e., the smaller the opacity σ), and 2) the closer the neighboring points are (i.e., the smaller δ), the larger the accumulated transmittance T of that point.

129. Simply put, the larger the transmittance of the i-th point, the less obstructed it is by objects in front, making it more visible in the rendered image.

130. That’s why I labeled this term in Num 126 as ‘contribution weight.’ The larger the T_i value, the more visible the i-th point will be, exerting greater influence in the rendered image.

131. Let’s now explore the completion of the first line. It multiplies (1 — exp(σδ)) to T_i.

132. In fact, this is reasonable. T_i signifies how visible the point is, and this term conveys that the smaller the opacity of that point, the more visible it is. By combining the two, it expresses how visible the point is. Hence, I’ve labeled this term in Num 126 as ‘volume possession’ — representing how well the point is visible.

133. To summarize, a point’s influence in rendering is determined by a combination of 1) how little influence the points in front of it have and 2) how much influence the point itself holds.

134. Currently, we’re addressing the section labeled as ‘uniform sampling’ under the arrow. While we’ve acquired various features from the Triplane T for rendering, the sampling points’ intervals (δ) are fixed without additional considerations. It means that whether there’s an object or not, sampling is performed uniformly without much consideration.

135. However, upon reflection, it seems more rational to perform intensive sampling where objects exist and avoid such wastefulness in empty spaces. Thus, instead of this brute and honest method, we’ve decided to employ adaptive sampling.

136. Initially, after obtaining the Triplane T, we define 192 sample points per ray as coarse samples, resulting in obtaining the image I_128 of size 128 x 128. Additionally, we acquire the weight tensor P_128 of shape 192 x 128 x 128.

137. Subsequently, we pass P_128 through a CNN to obtain P-hat_512.

138. Consequently, by upsampling the low-resolution weight P_128, we acquire more precise weights. This means that unlike the initial sampling when we had no knowledge of the actual object, we now possess a proposal regarding where the object might be more likely to exist.

139. Naturally, we can perform robust sampling, concentrating on the parts where more information about the actual object is expected to be concentrated.

140. This corresponds to the adaptive sampling part indicated by the second arrow in the diagram.

141. With the existence of the Proposal network, we determine the smallest set of bins that exceed the probability threshold of 0.98 for each discrete predicted PDF p-hat. In other words, we are identifying the smallest set of sample points whose probability of the object’s presence exceeds 0.98.

142. To maintain rendering efficiency, we employ regularization in addition to determining the set of sample points with a high probability of object presence. However, for detailed insights, referring to section 4.5 “Regularization for High-Resolution Training” in the paper is advisable, as I won’t delve into it extensively here.

143. Now, we just have one more question to resolve.

144. While we understand that the Proposal network enables more intelligent sampling through adaptive sampling, how exactly is this proposal network trained?🐾

145. The details for this are presented in section 4.3 of the paper. Initially, we sample 192 coarse samples from a given camera but in high-resolution, using a 64 x 64 patch. Thus, the ground truth P-bar_patch becomes 192 x 64 x 64.

146. We perform the following steps on it:

147. Ultimately, to train the Proposal network, we compute the cross-entropy loss between the predicted p-hat_patch by our proposal network and the ground truth p-bar_patch.

image from here

148. You can summarize the differences in the sampling methods we’ve discussed so far at a glance in the diagram below.

149. Now, let’s summarize the final training pipeline and conclude the structure.

image from here

150. The first term, L_adv, is a standard loss term used in typical GANs.

151. The second term, L_sampler, as discussed in Num 147, is the loss of the proposal network.

image from here

152. The third term, L_surface, is the loss arising from the regularization part we skipped earlier. Let’s just confirm its form in the equation below and move on.

153. Lastly, L_dec is a loss added to address the issue that when a proposal network is introduced, points tend to concentrate only around the surface of the object, causing gradients to flow around points near the surface. Therefore, we leverage the I_128 obtained before the proposal network’s existence to create a 192 x 128 x 128 S_128, and this is collectively taken into account.

154. Up to this point, we’ve briefly examined the overall structure proposed in our paper. In the experiments presented, there are results compared with the state-of-the-art EG3D.

155. This is included for reference as a qualitative comparison, but the actual paper contains quantitative comparisons and several ablation studies, which might be worth checking out if you’re interested.

156. Finally, let’s briefly review the discussion section to wrap things up. Despite observing some unwanted artifacts in our paper’s results and the challenge of representing transparent objects, there are noteworthy aspects: 1) adeptly designed sampling strategies enabling full-resolution rendering during training, 2) operating with around 20 samples, unlike conventional 3D GAN models that generally require 96 samples per ray, and 3) exhibiting better performance than previous 3D GAN models.

157. After reading through it all, the subtitle of this paper, “Rendering Every Pixel for High-Fidelity Geometry in 3D GANs,” becomes a bit clearer.🌟 While the initial focus of this text was more on “3D GANs,” the emphasis now shifts to “Rendering Every Pixel.” By designing the sampler as learning-based, the necessary samples per ray have been reduced by about 1/5, allowing for viewing “every pixel” in high-resolution images during both training and inference within a shorter time.

158. And just like that, we’ve sprinted through the freshly baked paper that hit the shelves just a few hours ago. As there might be parts missed due to quickly skimming through, I’ll bring forth any updates or new revelations that may arise upon a closer second read. Let’s keep those neurons firing! 📚🚀

This concludes the review, with the current content covering up to number 158. I may add or edit information in the future to provide updates. This post is primarily for my personal understanding of the topic, and I kindly acknowledge that there may be errors or inaccuracies present. If you come across any, please feel free to bring them to my attention.

Thank you for taking the time to read this, and congratulations🎉🎉

NumByNum :: What You See is What You GAN — Rendering Every Pixel for High Fidelity Geometry in 3D GANs (Trevithick et al., 2024) Reviewed

Written by Aria Lee