Estimating depth for self-driving cars

Published in

Analytics Vidhya

4 min readApr 5, 2020

The main idea was to teach a network to see depth from a RGB image. Many papers have achieved great results in that field and I decided to recreate some of them to see for myself how does it work. So my intuition of why it works is that humans can drive a car by looking at a live stream from a camera placed on the car’s windshield, so if humans can see depth, computers can as well.

Data

I collected my data from a self-driving simulation called CARLA. The simulation was giving me images of RGB camera as well as segmentation camera and also depth map, additionally I added 2 RGB cameras on each corner of the car (90 cm apart from each other), to use stereo images as well.
Data consists of about 40K of 320x180 sized images.

Model Architecture

At first I made a Pytorch model of FCN (Fully Convolutional Network) without a pre-trained Encoder (that was a stupid idea), and later I changed it to VGG-encoder based architecture, which I found on github. I set the learning rate to 0.002, used Adam optimizer and batchnorm layers.

Training

I started my experiments with single image approach, used a simple MSE error, trained for 2K iterations and obviously got no meaningful results (Even though many papers used a simple MSE loss they trained their networks for several million iterations). Then I changed it to Huber loss

With this loss I started using other inputs, like 2 (stereo) images of left and right cameras combined together (in a way to form a single 3 channel RGB image), and also 2 images stacked on each other (6x320x180). Mostly trained them for 4k-6k iterations, batch size of 5-10, and 20 epochs per batch. With Huber loss I couldn’t get any results ( maybe if I trained for longer I would get results, but my computational budget didn’t allow me that luxury), so I decided to use other loss techniques. I decided to use the loss I saw in this paper by David Eigen.

my implementation for that loss

And after implementing that loss I can show you my results after very few iterations of training.

same input of left and right camera images combined

model prediction after 3K iteration training for single image

model prediction after 6K iteration training for combined stereo image

middle camera images for stacked stereo input (this image of the middle camera is not used in the input)

model prediction after 6K iteration training on stacked stereo image input

So the real problem is that, there are no cars visible in the predictions, and that’s bad, because the first thing we need to see here are other agents of the road, and I am sure that after a very long time of training the model will be able to do what we want it to do. For the final note I left single image method to train for 12K iterations more, and here are couple of results.

Not bad for 12K iterations right?? Here, some more.

All this code you can find on my github page.

Estimating depth for self-driving cars

Data

Model Architecture

Training

Written by Mels Hakobyan