What You Need to Know About Neural Radiance Fields ( NeRF )
Capturing views of the world with a camera and using rendering techniques for novel view scene synthesis have always been a fascinating way to reproduce visual perception. Nowadays Neural networks can produce fresh views of a 3D scene with quality that deceives the human eye thanks to NeRF.
Introduction to NeRF
NeRF was introduced in a ECCV 2020 paper and has since become a popular way to model scenes and render from novel views thanks to its high quality reproduction.
In short, NeRF is a generative model that depends on a set of pictures and precise positions and rotation angles. It can learn patterns from image data without the need for convolutional nor transformer layers. As a result, it describes its 3D shape and appearance as a continuous function.
The key idea of NeRFs is to make neural networks learn 5D features and generate a density σ and c=(R,G,B) colour value.NeRF is a function of the kind F: (p,d) -> (c,σ).
NeRf pipeline
Although it’s still not clear how to achieve a novel view image, NeRF relies on a concept called light fields which is a vector function that reflects the amount of light flowing through each point in space in each direction. This approach describes the direction of light rays moving through every p=(x, y, z) coordinate in space and in every direction d, which is either represented as a (θ , ϕ) couple of angles, or a unit vector.
The NeRF model has eight layers, each with a feature dimension of 256.
First this “vanilla” model takes as input a positional encoder that describes the placement or position of an entity in a sequence, with a distinct representation given to each position. Hence, it allows the model to render sharper details.
Second the MLP (multilayer perceptron) transforms the encoded 3D points and returns an RGBA value. Alpha value is used to weight each RGB sample. It indicates whether an area is opaque or not.
The model can then sample a continuous space using the stratified sampling strategy, conditioning the network to learn over a continuous space.
As a final step, the MLP allows thin and complicated structures like meshes and branches to be realised by finely refining the details, using hierarchical volume sampling. The inverse transform sampling is then used to collect second samples.
Once all the samples are processed through this algorithm, the final output is produced.
Training datasets
View synthesis aims to come up with novel views from one or multiple given sources. Although existing ways have achieved promising performance, they typically need paired views of various poses to convey a constituent transformation. In this section, we will elaborate on some datasets related to scene synthesis.
Local Light Field Fusion (LLFF) :
LLFF is an approach that offers you the opportunity to capture and render novel views of real world scenes. In order to achieve a good dataset the captured images should have a maximum disparity between views of no more than about 64 pixels.
BLEnder Forward-Facing (BLEFF):
BLEFF images are constructed from 3D photo-realistic models (synthetic data) made using Blender, which is a free and open-source 3D computer graphics software tool.
This dataset provides high-resolution video sequences as input, which can eventually be decomposed into frames.
DTU Multi-View-Stereo dataset (MVS):
This arrangement uses an industrial ABB robot enclosed in a black box.
Defining key metrics
In order to evaluate and compare the models we developed properly, we selected a set of key metrics that are mostly used in the image reconstruction field of computer vision.
Peak Signal-to-Noise Ratio (PSNR):
Peak Signal-to-Noise Ratio is a general measurement that assesses the main signal quality by calculating its proportion to the unwanted background noise. It is mostly expressed in logarithmic scale (decibels) due to the wide dynamic range of the signal in question.
Structural SIMilarity index (SSIM):
Structural SIMilarity index is a metric designed to measure the perceptual similarity between two images. It combines 3 factors: luminance l, contrast c, and structure s. These factors are respectively raised to the power of the weights α, β, and γ (usually set to 1), and then multiplied to provide the final score.
Mathematically, l, c and s are determined using the averages, variances, and covariance of the target Y and prediction Yˆ .
Learned Perceptual Image Patch Similarity (LPIPS)
Learned Perceptual Image Patch Similarity is used to determine how similar two images are to the human eye. It calculates the similarity between the activation functions of two picture patches, each of shape [N, 3, H, W].
Conclusion
In this article we had an insight into NeRF, its architecture, the different types of inputs as well as metrics .
As a conclusion, most of these image generators require intensive computation resources and a limited type of inputs that results in a restricted application in practical scenarios.
It is important to note that even the state-of-the-art deep learning models are not perfect and may not always produce accurate results. There may be times when the output of a model is not as good as desired, and this could be due to a variety of factors, such as poor model design, or biases in the training data.
Bibliography
[1]: NeRF at ICCV 2021 https://dellaert.github.io/NeRF21/
[2]: NeRF at ICCV 2022 https://dellaert.github.io/NeRF22/
[3]: NeRD: Neural Reflectance Decomposition from Image Collections NeRF at ICCV 2022 https://markboss.me/publication/2021-nerd/