[Paper Review] Neural Radience Fields (NeRF)

Soyeon Park
KLleon
Published in
7 min readApr 1, 2022

Introduction

NeRF is a deep learning-based 3D modeling technology that allows you to learn multiple angles of images of objects and create new angles of images. The simplest way to understand NeRF is to recognize that the image of objects I look at looks like two-dimensional information, but it actually contains three-dimensional Depth information.

If you look inside the glass bottle in the above image, you can see that the color of the picture inside the glass bottle, the color of the glass bottle, and the background color are all reflected.

In other words, the image of the object you see is an overlapped infomation of various depths.

In addition, the final image reflects not only the color of the glass bottle but also the transparency information of the glass bottle. Due to the transparency (low density) of the glass bottle, the picture in the glass bottle and the background color are also reflected in the final image.

The same goes for NeRF. It renders the final image based on the color and density information of each depth in the direction of viewing the object. To this end, NeRF outputs color density information by inputting 3D coordinate information for each depth point of an object and direction information about the object.

In other words, NeRF is a model that receives five-dimensional inputs (x, y, z, Θ, Ψ,) and outputs four-dimensional results (r, g, b, σ).

Here (x, y, z) are coordinates in three-dimensional space for each point of the object, Θ and Ψ are information about the direction in which the observer looks at the coordinates (x, y, z). In addition, (R, G, B) is the color information of the point (x, y, z) of the object to be calculated, and σ is the density of the point (x, y, z) of the object to be calculated.

Method

Overal Structure
The network NeRF uses has a very simple structure with fully-connected layers as follows: Location information x(x, y, z) and orientation information d(Θ, Ψ) are network inputs (the light green part of the figure below), and color information RGB and density σ are network outputs (the red part of the figure below).

In addition, the γ used for the input of the network (the light green part of the figure below) is Positional Encoding, and the details will be described below.

Volume Rendering
The method of rendering an image using the NeRF output (RGB, σ) described above is called volume rendering. This light can be expressed as r(t) = o + td, given that it emits light in the direction of a point in the object viewed by the observer. Here, o is the observer’s position, d is the direction in which light travels, and t is the variable for the distance the light travels. At this time, the RGB value, which is the color of a point in an object, is calculated as follows.

In the above formula, T is a function of calculating the transmittance using density information, and a discrete sample between t_n and t_f is used for integration operations. The method of extracting discrete samples is the same as the equation below, and it can be seen that they are randomly sampled within a certain range without sampling at the same interval.

Using the above sampling to convert the integration operation to the Σ operation, it is as follows.

Optimizing NeRF
In order to improve the performance of NeRF, the paper proposes two methods.

The first method is Positional Encoding, and the second method is Hierarchical Volume Sampling.

  • Positional Encoding

On the Spectral Bias of Neural Networks

According to the above paper, deep network learning focuses on low-frequency knowledge. In order to learn high-frequency knowledge well, it is necessary to convert network inputs into high-level information using high-frequency functions.

For example, low-frequency information in a human face image refers to low-resolution information such as face shape, eye nose position, and hairstyle, and high-frequency information refers to wrinkles, freckles, and eyes, which are high-resolution information of an image.

Therefore, NeRF uses Positional Encoding as a high frequency function to learn high frequency knowledge well. If the Positional Encoding function is γ, it is expressed as follows.

Therefore, Positional Encoding converts 1D information p into 2L information. Transformer also uses Positional Encoding, but the purpose of use is different from that of NeRF. If Transformer is used for time-series characteristics, NeRF is intended to switch input to high-dimensional domain.

  • Hierarchical Volume Sampling

You have to learn a lot of information in the space where things exist, but you don’t have to learn a lot in the empty space where things don’t exist. Therefore, NeRF uses Hierarchical Volume Sampling to perform more efficient operations, and NeRF uses Coarse Network and Fine Network. For CoarseNetwork learning, NeRF extracts N_c samples for a single ray and uses the discrete sampling method described above, the stratified sampling method. Using these extracted N_c samples, learn the Coarse Network, modify the formula below, which is the Volume Rendering method,

like this,

and consider the w_i^{hat} used here as a probability distribution function (PDF). In addition, w_i can be interpreted as a weight value for how much the color of each point through which ray passes is reflected.

As shown in Figure © on the left, the weight value reflecting color is high where the object is located, and the weight value reflecting color where the object is not located is low.

So this information lets you know where to sample more when learning FineNetwork.

It is effective because it uses both N_c, a sample used in Coarse Network learning, and N_f, an additional sample collected based on color weight PDF, and to sample and learn more about where things are located.

  • Implementation details

The data needed to learn NeRF are RGB images, camera poses for photographs, intrinsic parameters, and scene bounds. (NeRF uses a COLMAP package that creates 3D data from photos to get this information.)

The loss function used for learning is as follows.

In other words, the loss is calculated by comparing the rendered RGB color with the original RGB color using the coarse network and fine network. In the actual experiment, 4096 rays were learned in one batch, each of which was sampled with N_c=64 and N_f=128. Also, it takes 1–2 days to learn using the NVIDIA V100 GPU.

Results

Dataset

  1. DeepVoxels dataset : consists of Lambertian objects with four simple geometric structures. Each object is rendered in 512x512 pixels and is photographed in a hemispherical orbit.
  2. Complex real-world scenes : It consists of eight scenes taken with a cell phone. Each scene consists of 20 to 62 images, 1008x756 pixel images.

Experimental Results

As you can see from the experimental results, NeRF performs much better than other technologies. The paper contains more comparisons of qualitative results, but I won’t mention them in this article.

Review

KLleon also utilizes NeRF in various fields such as 3D face generation and body shape generation. However, we found that NeRF takes a long time to render, and that rendering of static objects is excellent, but the performance of rendering of moving objects is not yet high. Further studies are underway to solve this problem, and further papers suggesting solutions are also posted. In the next post, we will review the Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis paper, which studies few-shot learning in NeRF.

--

--