NumByNum :: Understanding Neural Fields & Neural Fields in Visual Computing and Beyond (Xie et al., 2022) Reviewed

Aria Lee
22 min readJan 16, 2024

--

This review of “Understanding Neural Fields & Neural Fields in Visual Computing and Beyond (Xie et al., 2022)” begins at Number 1 and concludes at Number 147. I may make additions or revisions to the content in the future for updates. I am creating this post primarily for my personal understanding of the subject, and I humbly acknowledge that errors or inaccuracies may be present. If you happen to identify any such issues, please do not hesitate to bring them to my attention. Thank you, and hope you enjoy😊!

  1. NeRF is short for Neural Radiance Fields.
image from here

2. But wait a neural second. What in the world is this neural field?🤔 It feels like we’ve been dancing around the concept without giving it the spotlight it deserves. So, how about we take this moment to systematically organize and understand the intricacies of the neural fields’ concepts? Let’s delve into the rhythm of these ideas.

3. The cornerstone for our exploration lies in the paper “Neural Fields in Visual Computing and Beyond” (Xie et al., 2022), which was mentioned in the previous What You GAN paper review. Although it’s not the latest survey paper, taking a brief look seems valuable since it delves into crucial concepts.

4. Today, we will follow the following sequence in our discussion. First and foremost, we will precisely define what a neural field is. Then, we’ll categorize the topics where research choices diverge into 1) prior learning and conditioning, 2) hybrid representations, 3) forward maps, 4) network architecture, and 5) manipulating neural fields.

5. The first hurdle to tackle is understanding exactly what our main topic, the neural field, is. We’ve already explored this in the What You GAN paper review. Let’s quickly go over it.

image from here

6. To understand the neural field, first, we need to grasp what a “field” means. It might be familiar, but in simple terms, a field refers to a physical quantity that varies at each point in space.

7. For instance, if there’s a function representing the distribution of temperature in space, assigning a single number (representing temperature) to each point makes it a field. Hence, Definition 1 above expresses a field as a value given by a function over time and space.

image from here

8. Let’s look at the figure above. For instance, an image represents a vector field that, when given 2D (x, y) coordinates, is determined by RGB values. If you’ve studied SDF, you’d know it represents the distance from a point to the surface of a 3D object; when a point in n-dimensions is given, the signed distance to the surface is represented by a single number, making it a scalar field.

9. Therefore, a neural field refers to a field whose values are determined by a neural network.

image from here

10. Now, the title of the NeRF paper, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis,” becomes clearer. When inputting position coordinates and viewing directions, it returns RGB values and density for that coordinate. NeRF accomplishes this using MLP.

11. In summary, a neural field simply means assigning values for each point in space, where the values are determined by a neural network. This might be a bit more precisely defined, but let’s stick to this level of understanding for now and move on.

12. So, why has this technique gained so much attention over the past few years?

13. Fundamentally, a neural field is characterized by being continuous and adaptive.

14. Representing a 3D scene as a discrete data structure would require a substantial increase in memory as the resolution rises. However, utilizing a neural network ensures that the complexity increases only by the number of parameters in the network, rather than the entire resolution.

15. In other words, with sufficient parameters, a neural network can efficiently store a continuous signal of arbitrary dimension and resolution. Additionally, parameterizing the neural field with an MLP allows optimization through analytic differentiability and gradient descent, even for ill-posed problems, given the well-defined gradients of the activation function.

16. Having grasped the concept of a neural field, let’s take a brief journey into its operations before delving into the pivotal research directions tied to each step.

17. The general algorithm for a neural field involves 1) initially sampling coordinates in space-time. Next, 2) these coordinates pass through a neural network to obtain field quantities, which become samples corresponding to the desired 3D reconstruction domain. Then, 4) a forward map is applied to match this 3D reconstruction to the sensor domain (e.g., an RGB image). Finally, 5) the reconstruction loss between the reconstructed signal and sensor measurement is calculated to optimize the neural network.

image from here

18. While maintaining this overall flow, various research explores potential variations.

19. Through 1) prior learning and conditioning, it’s possible to induce better performance in reconstructing incomplete sensor signals.

20. Alternatively, 2) hybrid representations using discrete data structures may enhance memory or computational efficiency.

21. During the reconstruction process, 3) using the forward map, it’s possible to transform our representation to the sensor domain, providing supervision.

22. Improvements in performance and efficiency can be achieved through 4) the neural architectural choice.

23. Finally, to make the representation editable, it’s possible to 5) manipulate the neural field.

24. Within these five categories, diverse researches make distinct choices. Hence, let’s systematically explore each of these five items, dissecting the approaches along with their respective advantages and limitations.

25. Let’s start by examining 1) prior learning and conditioning.

26. Consider a task where we aim to predict the surface of a 3D object based on a given point cloud representing it. To achieve this, we require a suitable prior for the 3D surface we intend to reconstruct.

27. The most straightforward approach might involve crafting the prior manually using heuristics. However, in an effort to avoid such brute force methods, we opt to learn the prior from the provided data and embed it into the model’s parameters and architecture.

28. This constitutes the ‘prior’ component in prior learning and conditioning.

29. Nevertheless, if we are to include a prior, there could be instances where we wish to incorporate one tailored to specific conditions. For instance, when working with street scene data, it could be advantageous to include different priors based on whether the observed object is a bicycle, car, motorcycle, or truck.

30. To accomplish this, we can implement ‘conditioning.’

31. We construct a latent variable space Z, and the latent variable z, derived from Z, is designed to encode a specific field. Then, the neural field is conditioned on this latent z.

32. This raises questions about how to train this latent variable z, how to infer z appropriately from our observations, and how to condition the neural field on z.

33. So, we will now explore 1) various methods for encoding the latent variable z and 2) the types of global and local conditioning techniques, along with 3) mapping functions Psi that map the latent variable z to the neural field parameter Theta.

34. Firstly, there are various options for 1) encoding the latent variable z.

35. The simplest option is likely using a feed-forward encoder. Assuming we have an encoder E and an observation O, the latent code z is obtained as z = Encoder(O).

36. Here, E is typically a neural network, and the parameters of the encoder, pre-trained to fit the data, play a role in encoding the prior. Naturally, the decoder becomes a neural field conditioned on the latent code produced by this encoder.

37. Since only one pass through the encoder and one through the decoder are needed during inference, the processing speed is expected to be fast.

38. On the other hand, there is also the option of using an auto-decoder.

39. In this method, only the decoder is directly defined. Instead of a separately defined encoder, encoding is achieved through stochastic optimization. In other words, while observing the dataset, z = argmin_z Loss(z, Theta) is computed to optimize z by minimizing the loss.

40. Specifically, at a certain point in training, the latent code z_i is mapped to the neural field’s parameters Theta. Then, the reconstruction loss of the neural field is calculated, and backpropagation is applied up to (z_i), allowing the neural field and per-observation latent codes z_i to be optimized together.

41. During test time, when presented with a new, unseen observation O, the parameters of the neural field Theta remain fixed, and only the latent code z-hat that minimizes the reconstruction loss is optimized. In other words, at this point, the auto-decoder calculates argmax_z P(z|O) to find the ‘most plausible latent variable z for the given observation O.’

42. It’s clear that this method will likely take more time compared to the initial approach.

43. However, this method introduces no additional parameters and avoids making arbitrary assumptions about the observation O, unlike the first method.

44. For example, if the first method assumes an encoder as a 2D CNN, it starts with the assumption that O is a 2D pixel grid. In contrast, the auto-decoder defines latent variables according to the data without such separate assumptions, making it more robust.

45. Now, let’s compare the feed-forward encoder method and the auto-decoder method with a diagram.

image from here

46. On the left is the feed-forward encoder method. As seen, a separate encoder E is used, and the loss of the neural field is used to update the encoder itself.

47. In contrast, the auto-decoder method on the right does not use a separate encoder. Instead, the latent variable z is jointly optimized with the neural field.

48. We have now divided the discussion into 1) methods for encoding the latent variable z and 2) the two types of conditioning. The purpose of encoding this latent z is to condition the neural field to include a prior fitting specific conditions.

49. Now, it’s time to decide 2) the conditioning technique.

50. We will broadly divide conditioning into global and local conditioning. As the words suggest, global conditioning refers to a method where a single latent code z determines all coordinates of the neural field uniformly. In this case, one z contains a vast amount of information.

51. However, intuitively, there are problems with this approach.

52. For example, if we receive a photo of a road as data, it would be valid to think of it not as just a road photo but as a diverse collection of various elements such as pedestrians, asphalt, cars, and motorcycles. Placing all this diversity into a single z is challenging.

53. Therefore, we choose to whisper sweet instructions directly into the neural field’s ear — a.k.a. local conditioning.

54. In local conditioning, we need to obtain not just one but a latent z suitable for each position. Thus, z is calculated as a function of coordinate x, represented as z = g(x).

55. In this case, g represents a discrete data structure expressing locality. For instance, imagine dividing 3D space into cubes. These cubes become the discrete data structure g, and each cube contains the latent code corresponding to that cube. Passing a specific point’s coordinates x through this structure would allow us to obtain the corresponding z.

56. By adopting this method, latent code z encodes information only within a slightly more local range rather than the entire scene. There’s no need to store the configuration of the entire road anymore.

57. While this improves out-of-distribution generalization, it makes high-level control challenging since it lacks global scene information.

58. Let’s compare global conditioning and local conditioning.

image from here

59. On the left, global conditioning involves a single z that conditions the entire neural field. On the right, local conditioning utilizes a coordinate-dependent z = g(x) to control the neural field at each point.

60. Now, having explored 1) methods for encoding the latent variable z and 2) the two conditioning types, we need to decide 3) how to map the latent variable z to the neural field parameter Theta.

61. While the specific techniques may vary across approaches, the overarching flow involves passing the latent variable z through the mapping function Psi to obtain neural network parameter Theta = Psi(z). The approach diverges based on how Theta and Psi are defined.

62. The first method is conditioning by concatenation. Here, we simply append the latent variable we obtain to the input x entering the neural field. If the neural field model transforms from 2D to 3D, and the latent variable z is n-dimensional, the final neural field becomes a model mapping from 2+n dimensions to 3 dimensions.

63. In this case, the mapping function Psi is merely Psi(z) = b. The latent variable z undergoes a simple affine transformation into the neural field as it is added as bias b.

64. The second method is the hypernetwork approach, where we construct the mapping function Psi as a separate neural network. Intuitively, this defines Psi not as a function but as a neural network, representing a more generalized version. Ultimately, the first method is a special case of this approach.

65. The third method involves using feature-wise transformation. Here, we pass z through the neural network Psi to obtain Psi(z) = {per-layer scale gamma, bias beta}. Then, for the i-th layer of the neural field, we calculate using the formula Phi_i = gamma_i(z) * x_i + beta_i(z), incorporating the obtained gamma, beta, and the input x_i of the layer.

66. Let’s summarize where we are. We initially defined what a neural field is and then categorized the topics that vary in research into 1) prior learning and conditioning, 2) hybrid representations, 3) forward maps, 4) network architecture, and 5) manipulating neural fields. Currently, we are exploring the first category, 1) prior learning and conditioning.

67. Now, let’s explore 2) hybrid representations by attaching a discrete data structure to the neural field to enhance efficiency. Let’s grasp the concept with the figure below.

image from here

68. Instead of using implicit formulas, we can represent a 3D structure using very physical and discrete data structures. These structures have clear advantages.

69. They are relatively less computationally intensive, have high efficiency in terms of network capacity (unlike MLP, their representation capacity does not diminish), fast rendering (empty spaces allow skipping rendering), and are advantageous for manipulation and editing.

70. Studies introducing discrete data structures operate by storing neural field parameters in the data structure g and querying parameter Theta by mapping coordinate x to g for a specific coordinate.

71. However, these studies differ in 1) how they map parameter Theta to data structure g and 2) the selection of the data structure g.

72. The former is straightforward, with some using network tiling, and others using embedding. Network tiling involves laying multiple individual neural fields like tiles in the input coordinate space to cover the entire region. In this case, when given coordinate x, we can simply look up the network parameter from the data structure g.

image from here

73. On the other hand, embedding involves storing the latent variable z in the data structure. In this case, the parameters Theta of the neural field become a function of local embedding z = g(x).

image from here

74. The choice of data structure for g also leads to different directions. We can broadly classify data structures for g into regular grids and irregular grids.

75. Regular grids mean dividing space into a data structure with a consistent interval. Examples include 2D pixels or 3D voxels, which divide space into regular intervals of squares or cubes.

76. Naturally, they are convenient for indexing and can be processed standardly.

77. However, regular grids face memory scaling limits as they need to divide the entire space at regular intervals, regardless of the presence of objects. This leads to the introduction of grid methods focusing more on high-frequency regions, such as adaptive grids, sparse grids, and hierarchical trees.

78. In contrast, irregular grids do not sample the domain in a regular pattern. Point clouds, meshes, and object-centric representations mentioned frequently in the context of explicit 3D representations fall into this category.

image from here

79. Returning to the image briefly, A) represents a neural sparse voxel grid, B) a multi-scale voxel grid, C) an object bounding box, D) a mesh, and E) an atlas. A) through D) includes two regular and two irregular grids, which should be sufficient to understand.

80. Thus far, we’ve unraveled the mysteries of 1) prior learning and conditioning and 2) hybrid representations from our fab five categories.

81. Now, let’s delve into 3) forward maps. Let the neural field adventure continue! 🚀🤖

82. In simple terms, a forward map is a function that maps the reconstruction domain to the sensor domain.

83. It might seem complex, but consider NeRF for simplicity. In NeRF, we receive multiple 2D images and create a 3D reconstruction of the actual 3D object corresponding to those images.

84. Here, the reconstruction domain is where 3D representation is created, and the given 2D images are the sensor domain.

image from here

85. In familiar 3D reconstruction tasks, the sensor domain contains 2D raster grid camera images, and the reconstruction domain contains continuous 3D representation such as SDF or radiance field.

86. Applying this relationship to various 3D representations results in the figure below.

image from here

87. In our paper, we introduce forward maps, mainly focusing on rendering and physics-informed neural networks. We will explore the former, which is encountered frequently, here.

88. When reading papers related to 3D vision, the term “rendering” is almost always mentioned, and the tool that performs this task is referred to as a renderer.

89. Using the terms we’ve learned, a renderer is defined as a forward map that converts the neural field representation of a 3D object into an image.

90. The renderer takes camera intrinsic parameters, extrinsic parameters, and the neural field as input and returns an image corresponding to those camera parameters. Camera intrinsic parameters refer to parameters inside the camera, such as field of view and lens distortion, while extrinsic parameters refer to parameters outside the camera, such as translation and distortion. For detailed explanations, please refer to the 3D Gaussian Splatting paper review.

91. Renderers typically use a raytracer, a tool that takes ray origin and direction and returns information about the neural field. In NeRF, for example, the model receives 5D coordinates (x, y, z, theta, phi) as input and outputs the color and density of the corresponding point.

92. The information returned by the raytracer, such as surface normal or intersection depth, can be geometric or other arbitrary features. For illustration, we will briefly explore ray-surface interaction as an example of the former and volume rendering as an example of the latter.

93. Let’s briefly discuss two often-used 3D representations: occupancy maps and signed distance fields, which fall into the category of irregular grids. Below is an example of a 3D occupancy map…

image from here

94. And here is an example of a signed distance field.

image from here

95. In a previous review of the “What You GAN” paper, we briefly mentioned and compared these representations, as shown in the figure below.

image from here

96. Recall that if we have a 3D object with an empty interior, we can assign a value to each point representing the shortest distance to the object’s surface. This value is assigned positively inside the object and negatively outside, creating a signed distance.

97. Consequently, armed with this knowledge alone, we can pinpoint the location of a 3D object’s surface and effortlessly derive its representation. This, in turn, allows us to acquire a signed distance field that characterizes a particular 3D object.

98. Now, let’s assume our 3D object is given in this representation. How can we obtain geometric information like surface normal or intersection point from this representation?

99. One simple method is utilizing the technique of ray marching.

100. Ray marching involves advancing (marching) rays from the camera towards screen pixels, rendering the object surface when the ray touches it.

image from here

101. Let’s observe the process.

image from here

102. In the given scenario, ray origin and direction are set, and two objects, SDF1 and SDF2, are defined. The task is to advance the ray until it meets an object.

image from here

103. Initially, we calculate the signed distance function for each object at the ray origin, determining the shortest distance from the origin to each object’s surface. In the figure, these distances are d1 and d2.

image from here

104. Moving along the direction of the smaller distance, the ray reaches P1 after advancing by the amount of d1, as SDF1 is closer.

image from here

105. The process is repeated from the new position P1.

image from here

106. Each movement is termed a step, and the distance moved accumulates.

image from here

107. The process continues until the ray touches the object surface, as depicted in the figure.

image from here

108. However, if the ray does not reach the object surface but reaches the limit of steps or forward distance, the step ends without finding the object.

109. Performing ray marching provides the distance from the ray origin to the object, allowing us to calculate the spatial position of the object surface using the formula R0 + RD * d.

110. However, there is one issue. This surface rendering technique does not work well for inhomogeneous media like clouds or fog, as it does not handle them uniformly.

111. Since the gradient is defined only on the object’s surface, it struggles with ambiguous surfaces or when modeling thin, high-frequency details like hair.

112. So, let’s give a standing ovation to volume rendering as it takes center stage! 🌟🎥

113. Volume rendering involves shooting rays at the object, selecting samples along the ray, and using their cumulative sum as a volume rendering integral. This concept is straightforward when thinking about NeRF rendering.

114. Recall the summary of NeRF from the 3D Gaussian Splatting paper review for a moment.

115. NeRF involves 1) constructing training data with N 2D images and corresponding camera position information, 2) using a fully-connected neural network that takes 5D coordinates (x, y, z, theta, phi) as input and outputs (color, density) for each point, 3) sampling 5D coordinates along the desired viewpoint’s radiance using hierarchical sampling, then passing these points through the NN to obtain color and density values, generating a new 2D image with volume rendering, and 4) optimizing the NN by calculating the loss against ground truth.

image from here

116. Here, the part “sampling 5D coordinates along the desired viewpoint’s radiance using hierarchical sampling, then passing these points through the NN to obtain color and density values, generating a new 2D image with volume rendering” precisely corresponds to volume rendering. If we refer to the presentation linked in the same article, the relevant section reads as follows:

image from here

117. Looking at (b) in the figure, you can see that beads-like samples are drawn along the rays, and then, for each of these samples, the value in the formula on the right is added up to calculate the color C(r) of the corresponding pixel.

118. An explanation of the formula has been covered in a previous review of the What You GAN paper. In case that article slipped through your radar, here’s a sneak peek for my fellow formula enthusiasts!📚

119. The actual formula used for volumetric rendering (in the What You GAN paper) is shown below.

image from here

120. It seems familiar, as we’ve encountered a similar formula in NeRF.

image from here

121. Looking at our rendering equation, it appears quite similar.

image from here

122. Here, δ represents the interval between samples along the ray. So, the expression for T_i in the second line implies that 1) the more opaque the points ahead are (i.e., the smaller the opacity σ), and 2) the closer the neighboring points are (i.e., the smaller δ), the larger the accumulated transmittance T of that point.

123. Simply put, the larger the transmittance of the i-th point, the less obstructed it is by objects in front, making it more visible in the rendered image.

124. That’s why I labeled this term in Num 126 as ‘contribution weight.’ The larger the T_i value, the more visible the i-th point will be, exerting greater influence in the rendered image.

125. Let’s now explore the completion of the first line. It multiplies (1 — exp(σδ)) to T_i.

126. In fact, this is reasonable. T_i signifies how visible the point is, and this term conveys that the smaller the opacity of that point, the more visible it is. By combining the two, it expresses how visible the point is. Hence, I’ve labeled this term in Num 126 as ‘volume possession’ — representing how well the point is visible.

127. To summarize, a point’s influence in rendering is determined by a combination of 1) how little influence the points in front of it have and 2) how much influence the point itself holds.

image from here

128. Let’s revisit our journey. We’ve defined what a neural field is, and now, the topics that vary in each study are divided into five: 1) prior learning and conditioning, 2) hybrid representations, 3) forward maps, 4) network architecture, and 5) manipulating neural fields. We are currently exploring the third one, 1) forward maps. This is where the familiar rendering process comes in, using surface rendering or volume rendering for that rendering.

129. Although there are hybrids and various other forward maps, we’ll breeze past them for now.

130. Now, it’s time to look at 4) network architecture.

131. We are training a relatively simple neural network to learn a very complex real-world signal. Therefore, the neural network tends to bias towards low spatial frequency information.

132. The most common method to address this is to use positional encoding.

133. Once more, let’s shine the spotlight on NeRF.

image from here

134. The network of NeRF looks like the illustration above. If you look closely, the input coordinate x doesn’t directly enter the MLP layer; instead, it goes through positional encoding indicated by the red arrow and transforms into gamma(x). The formula consisting of sine and cosine, shown in the topmost green box, is presented.

image from here

135. The effect is that the blurry lion on the left changes into the sharp lion picture on the right. In other words, by inserting it into a sinusoidal function, we increase the dimension of the input coordinate because it is too low to model high-fidelity details, allowing us to express high-frequency signals.

136. Besides using positional encoding, another method is choosing the right activation function. One notable example is SIREN, proposed in the paper “Implicit Neural Representations with Periodic Activation Functions” (Sitzmann et al., 2020).

image from here

137. We’ll delve into the specifics later; for now, let’s say it’s a model structure that repeats linear transformation and sine activation for the input x.

image from here

138. Passing through the model layers like this, the initial distribution following the top input alternates between the second and third distributions. By initializing the weight distribution with an appropriate uniform distribution, it forces the activation distribution to repeat arcsin with a fixed interval of the standard normal distribution.

139. If you’re interested in more details, it might be helpful to refer to the explanation in this video. At this point, let’s simply acknowledge that, among the numerous techniques for capturing high-frequency details, one method includes adding positional encoding to coordinates, and the other involves changing the activation function.

140. We’re just a whisker away from wrapping up today’s post.😊 Finally, let’s look at the concepts related to 5) manipulating neural fields.

141. For example, when representing a 3D object using a polygonal mesh, we can perform smoothing. As shown in the figure below, it means post-processing the already created mesh.

image from here

142. However, when it comes to employing a neural field, it’s not immediately evident how we can fine-tune a finished neural field to meet our specific requirements.

143. Fortunately, there are several research avenues delving into this, and we’ll offer a brief introduction without delving too deep.

144. The simplest method is to correct the input coordinates themselves. Techniques such as using structural priors like the object bounding box to transform each object, utilizing a kinematic chain technique to control the entire object through object joint angles, or warping the observation space to canonical space are being researched. Alternatively, for a more general transformation, we can model the spatial transform itself using a neural field. It involves continuously transforming the field itself to obtain the desired target geometry. Research is ongoing regarding smoothness, sparsity, cycle consistency, auxiliary image-space loss, and more.

145. Another method is to manipulate not the input coordinates themselves but the latent features or model weights directly. This would be more task-agnostic. Various techniques exist, including latent code interpolation/swapping, latent code/network parameters fine-tuning, and editing via hypernetworks.

146. We’ve taken a swift stroll through the fundamental concepts tied to neural fields so far. This is just a sneak peek into Part 1 of our paper — a mere fraction. Keep in mind, there’s a treasure trove of details left unexplored, not to mention the enticing Part 2, where we unravel the diverse applications of neural fields across various domains. Rather than cramming it all into one sitting, I’d suggest a leisurely read of the paper, letting those references guide your exploration.

147. Armed with this foundation, we’ll continue our journey, delving into the realms of 3D computer vision in future posts. 🚀 Stay tuned for more exciting papers!

This concludes the review, with the current content covering up to number 147. I may add or edit information in the future to provide updates. This post is primarily for my personal understanding of the topic, and I kindly acknowledge that there may be errors or inaccuracies present. If you come across any, please feel free to bring them to my attention. Thank you for taking the time to read this, and congratulations.🎉🎉

--

--