Preventing Fatalities by playing Video Games (Part 1)

Published in

Analytics Vidhya

4 min readNov 12, 2020

aka Training Self-Driving with Virtual Worlds

Self-driving cars should have already been here but have not yet arrived, even if Tesla, with the latest version of the FSD Beta, now seems very close to success.
In this series of articles, we try to understand why it is so difficult and time-intensive to build an autonomous driving system and how we can speed up this process using “video games”.

Deep Learning systems learn from examples, i.e., from input/output pairs. In the specific case of vision systems, these pairs are represented by images in Input and Structured Data in output. The input images are easy to generate because they are nothing less than the vehicle onboard cameras’ frames. On the other hand, the generation of the reference output data (the so-called “Ground Truth”) is much more expensive because it must be done by hand.
For example, for “Semantic Segmentation” tasks, each image’s pixel must be assigned to a class. In other words, each frame of the training videos must be hand-colored with the color corresponding to the specific pixel class (pedestrian, traffic light, tree).
Semantic segmentation is particularly necessary when the system must recognize objects or features of the image that do not have a shape determined a priori or strongly deformed by the perspective deformation from the onboard camera’s point of view.

Demo video of ICNet on cityscapes dataset

Training an AI system to solve the semantic task requires hundreds of thousands of these labeled frames to capture a large number of diverse situations. Beyond that, the labeling process can take up to an hour for any single frame making it impractical and costly to collect the necessary data.

So, why not speed up this process with synthetic images?

Is it possible?

The good news is that we can use a physics simulator to produce realistic images that are automatically labeled.
The bad news is that it is impossible to produce perfect images, and so, at a pixel-level, colors, reflexes, and brightness may drastically change.
We show here an example of real and synthetic images:

The top image is from a real city dataset; the bottom one is from the videogame GTA 5 dataset.

Why should this be a problem?

Even if the system works well on synthetic data, it won’t maintain the same performance on real roads; it lacks Generalizazion capabilities.

We can imagine a student preparing to take an exam by studying many exercises “by heart” without understanding the theory that governs them. During the examination phase, he will not solve problems that are slightly different or set from a different point of view: he will not be able to adapt to the new situation (to the new domain) because he will not have acquired the ability to generalize.
More technically, Domain Shift is the change in the data distribution between an algorithm’s training dataset and a dataset it encounters when deployed, and Domain Adaptation is the ability to manage this situation.

Source: Multi-Layer domain adaptation method for rolling bearing fault diagnosis

How to understand how well the model is working?

Once the model is trained to automatically reproduce the labeling process over a never seen before images, we need to evaluate it. In other words, we need a metric that indicates how well the model can recreate the labels.

The most commonly used metric for the semantic segmentation task is the Intersection over Union (IoU). The idea is so to maximize the area where the prediction and the ground Truth are overlapped.

I invite you to read “Metrics to Evaluate your Semantic Segmentation Model” from Ekin Tiu for further details of the different available metrics.

Another good source is Jeremy Jordan’s article from which I’ve got this image.

The value of IoU is in a range between 0 and 1, where 0 is the worse scenario, with no overlapping at all, while 1 is the perfect segmentation.
For the self-driving task, a lot of classes need to be simultaneously recognized. So, for each of them, we calculate the IoU score separately and then average the result over all the classes. The global obtained value is called Mean Intersection over Union (MIoU).

So, what have we done?

Now we have all the ingredients to depict a potential solution. We tried two different approaches bot based on Generative Adversarial Networks (GANs).

Pixel-level Adaptation: here, we transfer the real images' style over the synthetic ones reducing the difference (like colors, saturation, brightness…) at the pixel level.

Features-level Adaptation: here, we work directly with the model, and we find a way to train it directly with synthetic and unlabeled images.

To see the details check my next article: Preventing Fatalities by playing Video Games (Part 2)

Preventing Fatalities by playing Video Games (Part 1)

Is it possible?

Why should this be a problem?

How to understand how well the model is working?

So, what have we done?

Written by Enrico Busto - EN