INSIDE THE LAB: A Deep Dive Into The NeRF Architecture at artlabs
In our previous post about Neural Radiance Fields (NeRF), we introduced our pipeline here at artlabs. Now, it is time to do a deep dive into what NeRFs and their key components are.
NeRFs are neural networks that imagine a scene through a multitude of cameras. The model learns the pixel representation for each given camera during the training period and then shows us how the scene looks from a given viewpoint, even if the scene was previously unseen from that angle.
How do the NeRF models do this? They reconstruct the scene by shooting rays from each camera and using differentiable rendering to calculate the color of each pixel. The model then learns to optimize overall rays (hence, the scene) by comparing a rendered image with the original.
However, the original NeRF struggles at several points while doing this optimization. It is very slow (learning the representation of a scene may take longer than a whole day), it cannot account for any dynamic scenes or lighting (it bakes the lighting in the scene), it cannot generalize to other scenes, geometry extraction from it is difficult and hence, rendering the scene on a low-performance device is near impossible. These shortcomings created many research questions and thus, the researchers introduced many (and we mean MANY) new variants that can account for these shortcomings. We will not go into the specifics of these variants as they are not within the scope of this blog post, however, we will be mentioning some of them along the way.
Generally, NeRF models are composed of three key parts: The first part is the positional encoding, in which the rays that are shot from the cameras are transformed from 3D world coordinates to higher dimensions that correspond to real features in the scene. Transformer models are a particular source of inspiration here: the scene is represented with sinusoidal functions of increasing frequency. While the sine functions with lower frequencies correspond to broader features in the image, higher frequency functions explain the finer details in the scene.
Scene approximation is the second key part, whether done through a multi-layer perceptron or done with sparse voxels in combination with spherical harmonics (first introduced by Plenoxels); all neural radiance fields use a model to approximate the 3D color and density of the scene. This is also the most computationally expensive part of the radiance fields, as, during the inference, the model is probed multiple times to output color and density information at various points.
Finally, we have differentiable rendering — which is, in a sense, the heart of the model. There are quite a number of variations here too. The original NeRF model uses a ray-based integral which amalgamates the density and color in a neat single formula, whereas variants like NeuS put the signed distance function in the spotlight and render color using a separate network. Some other variants like mip-NeRF completely do away with ray-based calculations and rely on an imaginary cone between the camera and pixels for predicting color.
As artlabs, we are integrating all three key components with emphasis on high speed and high quality. In light of new improvements in the neural graphics space, we look at the NeRF models as a whole and dive into each of them to understand and apply them to our technology by evaluating these variants using three key components.
Author: Ahmet Balcıoğlu, AI Engineer at artlabs