**Depth estimation with deep Neural networks part 1**

Doing a survey with my colleague “*Mahmoud Selmy*” on state of the art techniques using deep neural networks to estimate depth maps from 2d images ,we decided to write mini blogs that we hope to be beneficial to anyone familiar with NNs despite the field he is applying them on, assuming NO previous knowledge with this domain.

Well to start , we must know what our input and output are then we should search what cost(ie: error aka loss)function can we use for backpropagation with to train the network. And that’s what this part about . Then start searching for suitable architectures.

**Input :**

A 2-D RGB-image.

**Output :**

A depth map for this image. Which is simply a 2-D matrix with the same size as the input image , that contains information relating to the distance of the surfaces of scene objects from the camera viewpoint. ie : the larger the pixel value means the farther the object is.

**Loss function :**

The following cost function formula is used in this paper : “Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture by David Eigen and Rob Fergus” and many other related papers.

Where:

D = log(Predicted depth map)

D* = log(Ground truth depth map)

d =D - D*

di = Di - D*i =difference between predicted and ground truth depth for pixel i

n = total number of pixels

Scale invariant loss helps measure the relationships between points in the scene, irrespective of the absolute global scale.

So we are familiar with the first term which is 1/n * the sum over the difference between log Di and log D*i for all i ranging from first pixel to last pixel n. Which can be simply called the mean squared error(aka L2 error). It only measures the average deviation between each pixel depth prediction and ground truth depth.

But what about about the second term ?!

Let’s see what it’s purpose is , then jump to how this is achieved.

using this visualization of these predicted and ground truth depth maps.

This output is considered nearly perfect by human estimation.

However the vanilla mean squared error would be large due to the color range of each image being individually **scaled**.

On the other hand our loss function with the added term would produce a much smaller error as it’s main issue is to have the relative depth relations between pixels being well preserved.

But how does this cost function term achieve this ?

This term (with the negative sign) credits mistakes if they are in the same direction and penalizes them if they oppose. Thus, an imperfect prediction will have lower error when its mistakes are consistent with one another. lets use an example with numbers to get the intuition as fast as possible.

So as shown above : by subtracting this term in each case from the the first term in the cost , we can find that the first case is favored by the total loss than the second , achieving what we’ve mentioned before:

credits mistakes if they are in the same directionand penalizes them if they oppose. Thus, an imperfect prediction will have lower error when its mistakes are consistent with one another.

We’ve talked in numbers , so let’s grasp the intuition behind using another example.

Thinking about this we can actually find that this leads to favoring a trained network that have deviation in predictions than the ground truth that keeps the **relative** depth relations between all image pixels(for simplicity consider predicting Di = 2D*i for each pixel i) than another trained network that perfectly predicts some pixels of let’s say an object X in the middle of an image while another object Y is predicted to be nearer to the camera than X while it is in fact farther in depth. While without this term, the second network would be favored specially if Y is a small object (ie :represented by a small number of pixels) .

Although the first network predicted depth scaled by 2 than the ground truth, it kept the **relative** depth relations between X and Y by predicting that Y is farther than X . And our new loss chosen this first case by giving it a lower cost.

Which is what we are seeking to achieve … yaaaay !

Building on this, a new term is added to the cost function to be as follows :

The newly added term compares image gradients of the prediction with the ground truth , penalizing a sudden change (in any direction in 2d space) of the difference between predicted and ground truth maps. So that if pixel i and are 2 neighboring pixels , this term forces the difference between di and dj to be as minimum as possible , leading pixel i and j depth map values to be nearly equal if their ground truth is equal(ie: pixels i and j have same depth), and vary if their ground truth values were different(ie: pixels i and j have different depth).This encourages predictions to not only have close-by values, but also similar local structure. Which is experimentally found to produce outputs that better follow depth gradients, with no degradation in measured L2 performance leading to more accurate predictions.

That’s it for today . Hopefully you have grasped that minor modifications to traditional loss functions can improve performance.

We intend to dig into different architectures for approaching this domain in the next blog(part 2), followed by another one rich with implementation code and illustration.

If you are interested in this series and other ones approaching other domains specially related to sequence based predictions (eg: video classification)which is not well touched in most courses , kindly give this post a clap and follow for others to be published soon.

This post was strongly influenced by these papers :

2) Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

You can contact me on : Linkedin