Analytics Vidhya
Published in

Analytics Vidhya

Preventing Fatalities by playing Video Games (Part 3)

aka Training Self-Driving with Virtual Worlds

This series's previous two articles presented some challenges in training self-driving systems and the first methods to overcome: Pixel-level Adaptation. Today we’ll see a second approach: Features-level Adaptation.

Feature-level Adaptation

This method is based on adding a loss based on the segmentation model’s ability to fool a discriminator trained with the labeled data. If the model can do that even with never seen before images, it learned to produce very realistic results.

Moreover, this kind of training can also help to reduce the domain shift. Instead of working in pixel space, we operate with the respective segmented version to minimize the difference between the two images. The segmentation process cuts off many details like brightness or colors, maintaining only the semantic information. This allows us to feed the discriminator with the segmentation version of images from other domains.

Source: Learning to Adapt Structured Output Space for Semantic Segmentation

Now let’s go into the details of the network architecture used for this experiment:

The architecture is composed of a segmentation network (we called it generator) and a discriminator. A features extractor and two classifiers compose the generator. The segmentation of the input image is produced by an ensemble prediction of the two classifiers. We use this prediction to produce all three model losses.

The first one is the segmentation loss and is calculated with respect to the ground-truth label in the dataset.

Since we do not have enough labels to fully train the generator, we also use unlabeled images. For this reason, we can calculate the category adversarial loss with respect to the discriminator’s ability to distinguish the generator’s segmentation from the real one.

The adversarial loss is not uniformly calculated for all the predicted pixels: we want to penalize the areas where the prediction is uncertain mostly. These areas are indicated in the Local Assignment Map.

We use the two classifiers prediction to create this map: the more the predictions diverge, the more the model is uncertain.

The above architecture is already present in literature, and it is called Category-Level Adversarial Network (CLAN).

We extended this model by adding an auxiliary network to solve a Self-supervision task.

More precisely, it has been shown that solving auxiliary tasks, such as recognizing the orientation of an image, improves the generalization capacity of the learned model. This task does not require manual annotations: it’s a Self-Supervised method.

After being scaled to a squared resolution, the segmentation's output prediction is randomly rotated by a multiple of 90 degrees. The auxiliary network has the task of guessing the rotation impressed. The generated error is called Auxiliary Loss and, once propagate backward, improves the generalization of the features extracted by the encoder, making them invariant in space.

We assume the baseline to be the best value of MIoU calculated on the validation set of Cityscapes by training the base segmentation model (DeepLab v2) with standard parameters on the source dataset without domain adaptation techniques.

With this baseline, we obtain an MIoU of 37.97.

For the first experiment, we trained the original Category-Level Adversarial Network; in other words, we removed the self-supervised module.
Using only the adversarial loss, we obtained an MIoU of 41.82.

Then we tested the image rotation method alone, and we obtained an MIoU of 42.63.

Lastly, we tested the full architecture, and we obtained an MIoU of 39.87.

From this experiment, we can conclude that even if all the methods overcome the baseline, the self-supervised task is more efficient than the adversarial methods.

Lastly, we joined the two methods: we applied our final architecture on the dataset adapted with the Cycle-gan.
We use the best result obtained with the pixel-level adaptation method as a baseline, and we recreate the three experiments.

In conclusion, our experiments suggest that pixel-level adaptation techniques are more valid and promising than feature level adaptation for the datasets utilized in our experiments.

From a combination of these techniques, it is possible to obtain results similar to the state-of-the-art.

Furthermore, it has been shown that with a limited amount of real labeled data available, synthetic data can help the generalization of a semantic segmentation network.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Enrico Busto

Founding Partner and CTO @ Addfor S.p.A. We develop Artificial Intelligence Solutions.