5 things you must know about Neural radiance fields ⚡️

Thomas Rouch
Check & Visit — Computer Vision
7 min readSep 30, 2022

--

There’s a huge hype around NeRF. Let’s explain it briefly and debunk common misconceptions

Photo by Clément Falize on Unsplash

Introduction

Neural Radiance Fields, a.k.a. NeRF, has gained a lot of popularity recently in the field of Computer Vision. Its goal is to memorize a 3D scene from a sparse set of input images to synthesize novel views. In other words, once trained, you can use the model to render new images, as if you had taken these pictures yourself from new view-points.

Have a look on the NeRF presentation video to see it in action!

1. It has nothing to do with Deep Learning

When working with 2D images, Deep Learning is both very popular and powerful. It has outperformed most of the traditional techniques in classification and detection tasks. It then also started to spread to 3D Computer Vision applications. That’s why it’s natural to think that such a complex problem as Novel Image Synthesis could be solved by a large Deep Learning model.

On the contrary, the beauty of the original NeRF paper lies in its simplicity: it requires only a single multi-layer perceptron (MLP) to give mind-blowing results.

This compactness results from a smart cooperation between Neural Networks and Light Transport techniques.

When we look at an object, what we actually see is the aggregation of light rays reflected in the direction of the camera. If we could predict the direction-dependent emitted radiance of any point in the scene, we could easily reconstruct novel views by using Rendering techniques. As a consequence, instead of asking the model to directly render a 2D image, we make a fully-connected network predict the color (RGB) and opacity (σ) for any spatial location (x,y,z) and viewing direction (θ, φ). To sum up, we have to fit a single model with 5D input and 4D output.

The opacity (σ) is a floating-point value that controls in a differentiable way how much radiance is accumulated by passing through a point. To ensure multi-view consistency, we keep it independent from the viewing direction.

Each pixel can then be rendered by sampling and merging radiance samples along its corresponding ray using the accumulated opacity to weight the contributions. This is called Volume Rendering.

In the image below we can notice two peaks in the opacity: one for the first red object and the second for the blue object behind it. Obviously, once too much opacity has been accumulated along the ray, the contribution of the next more distant samples vanishes because we can’t see them anymore. Thus the overall color of this selected pixel will end up being red, while the blue object remains occluded.

Volume Rendering: RGB and σ samples along the light ray corresponding to a given pixel in the image - Image by the author

2. Volume Rendering is the key

Since the publication of the original NeRF paper in March 2020 many new papers have tackled the Novel View Synthesis task. And yet, most of them have decided to keep using the Volume Rendering part, i.e. when we aggregate radiance samples along a ray. Indeed, they mainly focused on improving the predictor for the view-dependent color (RGB) and opacity (σ).

Volume Rendering is already known for quite a long time in the Computer Graphics community since it has been introduced at SIGGRAPH in 1984. NeRF’s ingenuity was to use its discrete approximation to build an end-to-end differentiable model to solve the Novel View Synthesis problem.

A light ray is an half-infinite 1D line evolving in a 3D space and can be parametrized by a positive time t once we know its 3D origin o and 3D direction d.

Parametric light ray with origin o and direction d — Image by the author

Below is the discrete Volume Rendering equation as used in NeRF to predict the color along a ray with N samples (1 ≤ i ≤ N). It’s a weighted mean of the color samples where the contribution tends toward 0 as opacity accumulates along the ray.

(This can be derived from the continuous Volume Rendering equation by assuming that the color is locally constant on [ti, ti+1] intervals and computing the integrand.)

3. Neural Networks aren’t even mandatory

As explained previously, Volume Rendering seems to be the key, not the neural networks. The MLP is just one choice among many others to fit the underlying function that maps position and direction to color and opacity.

Plenoxels (December 2021) has freed itself from neural networks by replacing the coordinate-based approach with a sparse voxel-grid of opacity and spherical harmonics coefficients. Its somewhat provocative description makes it very clear: “Plenoxels: Radiance Fields without Neural Networks”. Each query is then obtained by trilinear-interpolation with the neighboring voxels.

Spherical harmonics are the spherical analog of the Fourier decomposition and allow to decompose any scalar function defined on the 2D unit sphere. We can use them for each voxel to approximate the function that maps a viewing-direction to a color (3 lists of coefficients for the 3 color channels). Glossy surfaces need a lot of coefficients while matt ones need few.

TensoRF (March 2022) stands for “Tensorial Radiance Fields”. They leverage tensor decomposition techniques to fix the O(n³) complexity of the voxel-grid and achieve high compactness. The model does not depend on a specific decoding function and can use for instance either neural features like NeRF or spherical harmonics like Plenoxels to predict the color.

Instant-NGP (July 2022), a.k.a. “Instant Neural Graphics Primitives”, plugs a multiresolution hash-encoding before the fully connected neural network to dramatically speed up the model. Their tiny-cuda-nn framework helps coping with the slowness of conventional MLPS for this very specific use-case. Neural networks are still used but most of the weights go into the hash-encoding.

  • Original NeRF: 9 hidden layers of 256 neurons and a final hidden layer of 64 neurons
  • Instant NGP: 2 hidden layers of 64 neurons (and hash-encoding weights)

4. Learned weights do not assume a specific camera model

Most of the time, an algorithm or model dealing with 2D images won’t work very well if we change the camera model. For instance, a 2D detection model like YOLO will probably fail if given images with very strong distortion like spherical 360° cameras.

What’s great with NeRF is that the rendering process operates on each pixel in an independent manner. This means that the training dataset can be seen as a batch of light rays with ground truth colors. This derives from the fact that the MLP doesn’t generate directly the output image.

As a consequence, we’re fine as long as we have a function that maps 2D pixel coordinates to 3D light rays. We can train on images coming from different camera models and can render images for any camera model.

  • Pinhole camera: Each ray starts from the camera’s position and passes through its corresponding pixel on the image plane.
  • 360° camera: Each ray starts from the camera’s position and passes through its corresponding pixel on the image unit sphere.
  • Orthographic camera: Each ray starts from its corresponding pixel on the image plane and travels along the Z axis (FRONT).

For instance, I proposed the implementation of 360° and orthographic camera models to the Instant-NGP GitHub repository of NVIDIA (See Pull Request).

5. Photogrammetry is far from dead

Put simply, Photogrammetry is mainly a family of algorithms that build a pointcloud or 3D mesh from an unordered collection of input images, by matching and triangulating 2D features.

This reconstruction task sounds very similar to the one that NeRF is facing, and some people claim that NeRF might be the death of photogrammetry. Indeed, traditional techniques provide accurate but sparse results and aren’t as good as NeRF-like methods at filling the gaps .

However, there remains one crucial point that many seem to forget. NeRF takes 2D images as input, but it also requires their camera poses.

As a result, it doesn’t really make sense to compare NeRF and Photogrammetry directly, knowing that NeRF requires Photogrammetry as a preliminary step.

Conclusion

I hope you enjoyed reading this article and that it gives you more insights on how NeRF-like methods actually work!

https://github.com/ThomasParistech

--

--

Thomas Rouch
Check & Visit — Computer Vision

Computer Vision Engineer who loves to dissect concepts/algorithms in detail 🔥