Neural networks in the photogrammetry

Oleg Postoev
11 min readMar 2, 2020

--

[RU] Russian version of this article

Photogrammetry — this is discipline with wich technologies usage you can make 3d-models from photos. You just need several (but better more) captures of object, interior or exterior, then pass them into software and got 3d-model.

Simple demonstration of photogrammetry process

This is how the resulting models may look after some processing:

Quixel demonstrate much better quality; they are engaged in the production of models using this technology.

Screenshot of Quixel website

Using photogrammetry can be quite wide, here are a few examples:

  1. Creating 3d-models for cinema or videogames.
    For example, for maximum realism in last “Call of Duty” many models was made by this way.
    Building a world of Modern Warfare
  2. Creating terrain model for building.
    Documentation of paintings restoration through photogrammetry and change detection algorithms
  3. Mapping.
    Photogrammetry on the fly
  4. Autopilots and robotics.
    3D reconstruction, shows depth of information a Tesla can collect from just a few seconds of video from the vehicle’s 8 cameras

Formulation of the problem

As stated above, photogrammetry is the disipline that defines the general approach, but in reality there are many programs for working with photogrammetry. Each of them works according to algorithms common to the discipline, although with some nuances.

Here is a screenshot of the standard pipeline for creating a 3D model from the Meshroom program:

Screenshot of Meshroom

Behind each of these blocks there are several additional auxiliary sub-algorithms, many parameters that affect the result and it is not always obvious exactly how they will affect the result.

These algorithms are quite old, and since they are at the core of almost every program, programs have common problems.

These problems include:

  1. Low performance.
    Creating a model takes from 30 minutes to tens of hours on a powerful PC.
  2. Poor work with repeating patterns.
    Textures of carpets, parquet, road tiles, etc. begin to double, and parts of the models are duplicated and superimposed on themselves.
  3. Problems with glossy and reflective surfaces.
    Dips or bulges appear on models that are very different from the actual shape of the object.
An example of problems when restoring a model from a photo. Source: https://www.agisoft.com/forum/index.php?topic=3594.0
Errors when restoring glossy surfaces. Source: https://www.reddit.com/r/photogrammetry/comments/dkii5u/colmap_vs_agisoft_2_low_settings_shiny_cup/

Looking at the problems of existing approaches, we can conclude that these are problems of algorithms, because humancan understand from a photo what an object should look like in 3D. Confirmation of this is a lot of tutorials on creating a model from several shots.

For example, some time ago I made a model of a car for animation. I had a couple of drawings and photos as a reference.

An example of creating a 3D model of a machine from several drawings. Source: my telegram channel

Why not try to use neural networks to solve the problem? If you search, you can find a lot of groundwork in this area, but for the most part they haven’t reached their actual application.
Collections with examples of such solutions:
https://github.com/timzhang642/3D-Machine-Learning
https://github.com/natowi/3D-Reconstruction-with-Neural-Network

For me, this is an interesting opportunity to practice in developing neural networks and get to know better the software side of Blender. And there is something to get acquainted with:

Screenshot of Blender. Inspiring!

Solution overview

The process of creating a model from snapshots consists of a large number of steps, as shown above. It would be ideal to replace them all with one neural network, and in the future this will probably be the case. But first, I want to solve the first stages — calculating the positions of the cameras in the scene, that is, the points from where the pictures were taken relative to the whole scene.

Demonstration of camera position reconstruction. Source: https://thehaskinssociety.wildapricot.org/photogrammetry

In general, this problem is solved as follows: on a pair of images, the program searches for common points and, determining their offset from image to image, calculates how far the image is taken and what rotation the camera made.

Demonstration of the SIFT algorithm in ideal conditions. Source: https://www.researchgate.net/figure/Scale-invariant-feature-transform-SIFT-matching-result-of-a-few-objects-placed-in_fig1_259952213

If you let the neural network solve such a problem, it will predict the distance between images along the three XYZ axes, and degrees of rotation relative to the other camera. The problem is that it will do this even for shots between which there are no common points at all. For example, when the pictures capture the opposite walls of the same room. In this case, you cannot use the result of the operation of the neural network. First you need to understand — is there something in common between the pictures.

See an example of two frames from the same scene, but from opposite angles. The common points in the images are too few to understand how far the second camera is relative to the first.

Source of 3d-model: https://www.youtube.com/watch?v=Crk5btO4WUw

Total: our task is to create a neural network that determines whether there is an intersection between images.

Dataset collection

Dataset — a set of data, which is divided into two parts. One for training the neural network, and the second to test its neural network during training does not see.

To train a neural network, you need to collect many training examples.
Each of them is represented by two images and a number that determines the degree of intersection of these images.

An example of a dataset unit. There are two images, and the number that indicates how much in common the second picture with the first. This number is calculated by the number of white pixels — if the whole picture is white, the number is one, all black is zero.

The first two images are easy to get in Blender’s classic approach — just start the animation of the camera in the room. And to simulate handheld shooting, add noise to the motion animation:

Camera animation

Click “Render” and get a set of images:

Now consider how to get a picture to calculate the degree of intersection of two frames.
In order:

  1. To simplify, I make all the materials and textures to a pure white color and without reflections:
Comparison scene with and without material

2. Now I put the camera in a box with a light source in the center and with a hole for the camera. The hole is large enough not to interfere with the camera, but no more. As a result, the light source illuminates only that part of the scene that the camera sees in this position:

Camera in a box with a light source in the center and a hole for the camera

3. We look at the result from this position and from another. It can be seen that the shadows are blurry and not only the part that is visible from the initial position of the camera, but also the shadows are illuminated:

Comparison of the result from the initial position of the camera (“root”) and the view from the side (“target”)

4. Reduce the size of the light source and the shadows become sharper. And reducing the number of reflections allows you to make the shadows completely black:

The effect of the number of reflections on the type of shadow (decrease from 4 to 0)

5. The result is a set of b / w frames characterizing the degree of intersection of frames:

The degree of brightness of each frame determines how much it has in common with the original image

In this work with the training of b / w personnel, there are many nuances.
For example, for an animation of 800 frames, you need to make 640,000 such b / w pictures to determine the correspondence of each frame to any other. This is not fast. For example, when rendering a frame in 1 second (which is considered good speed), it would take more than a week of continuous PC operation to render only one animation of 800 frames, and about 40 scenes for such animations were already prepared. I was not going to wait a year until everything rendered , so I optimized the rendering of one scene up to several hours (about 5), which is already acceptable.

Descriptions of the whole process of optimizations and similar nuances are enough for a couple more articles, but this is not about that.

Key point of render optimization

After processing the b / w pictures with scripts, we get such a file with a description of the result:

[
{
"scene": 1,
"root": 100,
"frame": 1,
"value": 0.1138
},
{
"scene": 1,
"root": 100,
"frame": 10,
"value": 0.176
},
{
"scene": 1,
"root": 100,
"frame": 100,
"value": 0.995
},

Here:
- “scene” — scene number, identifier;
- “root” — the number of the source frame;
- “frame” — the number of the target frame;
- “value” — the degree of intersection of frames.

Neural network training

Now that the dataset can be considered ready, it’s time to move on to training the neural network.

The first problem that I encountered is that there are much more examples in a dataset with a small “value” than with a large one. If you do nothing with this, then it is easier for neural networks to always predict numbers near zero and have an accuracy of more than 90%.

Uneven distribution of examples in the dataset

To solve the problem of such a shift, I divide the dataset into 10 groups by ranges. At each moment of training time, the neural network receives an equal number of examples of each group.
That is, when a neural network learns in packs (batches) of 50 examples, it receives 5 examples from each group.
From the same groups I get data for checking the neural network. We postpone every fifth example for a test sample — we need to understand how the network will work on data that we have not seen before.

To make sure that the neural network can adapt to the dataset, I first collect a small training sample (one batch) of fifty examples. We train the neural network on them until its accuracy approaches extremely high values. For example, I assume that errors of 1–5% percent will be enough. So, on a reduced dataset, retraining is necessary to achieve less than these values, for example 0.1–1%.

After several experiments, it turns out to choose the right parameters for starting training the neural network.

Comparison of the graphs of the loss function in the training and test samples, consisting of one batch. It can be seen that the network is well built for data, as expected

Now the neural network can be trained on the entire training sample and to achieve an error of only about 11–12%. This is a good result in a short time. By the way, an error of about 0, 50 or 100% would be a bad sign. But in this case, it is clear that the direction is correct and the result can be improved.

Loss function graphs obtained on a full training set

Now it’s time to add dropouts, normalization and other data science pieces to prevent retraining of the neural network. It turns out beautifully — the error is less than 9%!

Loss function graphs obtained on a full training sample, also with regularization

Before describing what I did next, I need to recall one important point: the neural network learns in batches — a set of examples from a training set. The number of examples in the batch is limited either by the architecture of the neural network (sometimes batches from one or a couple of examples are needed) or by the memory of the computer on which the training is run. I used 50 examples in the batch.

At this point, I was wondering if the average error in the training sample is about 9%, then what is the spread. Maybe it is in the range of 5–14%, and maybe in the range of 0–90%. According to the graph of errors, this is not understood, so I run the function “predict()” for all examples from the training set.

Histogram of neural network errors. For example, in 18,547 examples, the error is in the range of 10–15%, and only in 7 examples it is more than 35%.

The results were interesting, as was the subsequent solution.
It turned out that the network rarely makes mistakes. In only 2% of the examples, the network in its forecasts was more than 10% wrong.
At first, I saved these examples with increased error separately as “super-complex” and diluted them with each batch during training. The overall accuracy of the neural network has grown, but now the error of more than 10% began to manifest itself in other examples, but in general such cases became three times less (less than 0.7%). This temporary solution made me understand that the direction is right, although the implementation is not very good so far.

Before each training session, the batch is formed randomly. However, in Python, the random function can be assigned weights — a number for each of the elements that determines the probability of choosing this element.
As such a weight, I used the previously calculated complexity of the examples. The more the neural network made a mistake on one or another example, the more often this example gets into the batch.

So, I have a neural network with an acceptable error of 9%. It marks out the complexity of each of the training examples and run training adding complex examples often. The more often, the stronger the mistake.

Now the main feature of the learning process is the recalculation of complexity every 1000 batches of one percent of the dataset, such a test work on the neural network. And every 100,000 batches I run a full recalculation of the complexity of all examples, an exam type. This technique seemed the most balanced, since recalculation of the complexity of the entire dataset takes a significant amount of time (about 20–30 minutes), and recalculation of one percent is quite fast (only 15–20 seconds).

This approach was enough to reduce the error from 9% to 4.5. To solve this problem, such accuracy is more than enough. I am satisfied.

Now, having a couple of frames and a neural network, you can say with accuracy above 95% if there are overlapping areas between them and how many.

Conclusion

Let me remind you that the resulting neural network (I call it “Surface Match”) is only an intermediate link before creating a neural network that restores the position of cameras in a 3D scene. It is useful both for improving the dataset and for enhancing subsequent neural networks.

For example, with its help it is possible to compile a more complete dataset in which there will be all suitable frames for determining the position of cameras in the scene.

Or you can add this functionality to a new neural network. It is known that when a neural network solves several related problems, its accuracy and generalizing ability becomes higher.

Oh, yes, and dataset on the github here https://github.com/Dok11/surface-match-dataset

--

--