Estimating depth for self-driving cars

Mels Hakobyan
Analytics Vidhya
Published in
4 min readApr 5, 2020

The main idea was to teach a network to see depth from a RGB image. Many papers have achieved great results in that field and I decided to recreate some of them to see for myself how does it work. So my intuition of why it works is that humans can drive a car by looking at a live stream from a camera placed on the car’s windshield, so if humans can see depth, computers can as well.

Data

I collected my data from a self-driving simulation called CARLA. The simulation was giving me images of RGB camera as well as segmentation camera and also depth map, additionally I added 2 RGB cameras on each corner of the car (90 cm apart from each other), to use stereo images as well.
Data consists of about 40K of 320x180 sized images.

depth map given by CARLA simulation

Model Architecture

At first I made a Pytorch model of FCN (Fully Convolutional Network) without a pre-trained Encoder (that was a stupid idea), and later I changed it to VGG-encoder based architecture, which I found on github. I set the learning rate to 0.002, used Adam optimizer and batchnorm layers.

Training

I started my experiments with single image approach, used a simple MSE error, trained for 2K iterations and obviously got no meaningful results (Even though many papers used a simple MSE loss they trained their networks for several million iterations). Then I changed it to Huber loss

With this loss I started using other inputs, like 2 (stereo) images of left and right cameras combined together (in a way to form a single 3 channel RGB image), and also 2 images stacked on each other (6x320x180). Mostly trained them for 4k-6k iterations, batch size of 5-10, and 20 epochs per batch. With Huber loss I couldn’t get any results ( maybe if I trained for longer I would get results, but my computational budget didn’t allow me that luxury), so I decided to use other loss techniques. I decided to use the loss I saw in this paper by David Eigen.

my implementation for that loss

And after implementing that loss I can show you my results after very few iterations of training.

input of single image method
same input of left and right camera images combined
label for both methods
model prediction after 3K iteration training for single image
model prediction after 6K iteration training for combined stereo image
middle camera images for stacked stereo input (this image of the middle camera is not used in the input)
label for above input
model prediction after 6K iteration training on stacked stereo image input

So the real problem is that, there are no cars visible in the predictions, and that’s bad, because the first thing we need to see here are other agents of the road, and I am sure that after a very long time of training the model will be able to do what we want it to do. For the final note I left single image method to train for 12K iterations more, and here are couple of results.

Not bad for 12K iterations right?? Here, some more.

All this code you can find on my github page.

--

--