Semantic Segmentation with PyTorch: U-NET from scratch

Alessandro Mondin
7 min readJul 24, 2022

--

First of all let’s understand if this article is for you:

  • You should read it if you are either a data-scientist/ML engineer or a nerd who is approaching semantic segmentation.
  • You shouldn’t read it if you’re trying to understand multi-class semantic segmentation. My U-NET was trained on the Davis 2017 dataset and the the target masks are not class-specific (their color is random).

If you’re reached this point, then this article is for you. Let’s now focus on the implementation.

The whole model is composed by the following .py files:

  1. model.py
  2. dataset.py
  3. train.py
  4. configuration.py

MODEL.PY

This is the UNET architecture and the highlighted parts are the subclasses that I used to build the model: CNNBlock, CNNBlocks, Encoder and Decoder.

  • The CNNBlock just applies sequentially a convolutional layer, a batch normalization and a RELU activation function.
  • The CNNBlocks performs sequentially for x times a CNNBlock (2 in our case). Here the n_conv parameter means the number of convolutions done: if it’s set to 2 as in our case, then it will perform two convolutions.
  • The Encoder performs for a downhill number of times a CNNBlocks, stores a route_connection and then applies a MaxPool2d layer. If you go back to the image of the UNET architecture, you can visualize it by counting the number of times when two blue arrows are followed by a red arrow: 4 times. Since the last CNNBlocks doesn’t require a MaxPool2d, we add it outside the for loop.

Decoder → performs for uphill number of times a Transpose Convolution, concatenates the output with the corresponding route_connection and feeds the concatenated tensor to a CNNBlocks. Finally at the to top-right of the picture the output layer returns a tensor of shape [batch_size, n_classes=1, height, width]. Pay attention: the encoder’s downhill parameter must be equal to the decoder’s uphill one.

Lastly, we put all the pieces together to create our UNET class:

Here I would like cosider few pros and cons about my implementation:

  1. It is redundant: as shown by Aladdin Persson in his implementation, this model could be built by using ~70 lines of code (mine is ~150 lines). The main reason is that I created two distinct classes for the encoder and decoder instead of implementing them directly inside the UNET class. On the other hand, a pros of my version it’s the flexibility: you can experiment adjustments of the original UNET by just modifying the first_out_channels and downhill parameters.
  2. Allowing a second input variable in the Decoder forward(self, x, route_connection) it’s not very elegant. This could be solved again by writing the encoder and decoder inside a unique UNET class.

DATASET.PY

Since this class is ~80 lines I am not pasting it in the article but you can find it here.

This is the tree structure of the DAVIS_2017 dataset:

To clarify the visualisation I replaced all the classes with “class” in the subfolders of 480p
To help the visualisation I replaced all the classes with “class” in the subfolders of 480p

I adapted my dataloader to this one and I modified it to make it more concise and more readible.

How does it work?

Besides initialising the parameters, the __init__() has the purpose of creating two lists, img_list and labels, that contain all the paths to each single image in the same order (img_list[i] refers to labels[i]).

To achieve this we adopt this strategy (the following example is done for img_list but the same it’s done for labels):

  1. We initialise the class with the path to the root_folder (DAVIS_2017 folder).
  2. We open and load the train.txt which contains a list of all the classes (“bear”, “parkour”, etc)
  3. Via os.path.join() We reach each class_folder and then we list all the contained images with os.listdir().
  4. We apply this function: “list(map(lambda x: os.path.join(‘path_to->’, “class_folder”, x), images))” that for a given class_folder creates a list of paths to each image. You can read it this way: map(function, iterable) applies a given function to each element of an iterable. Here our function is the lambda which creates a path to img_name for each img_name in the images list.

Then the __getitem__(self, idx) method will retrieve an image and a mask based on their idx in img_list and labels.

The steps of the __getitem__(self, idx) are the following:

  • We load the images with the PIL library and convert them into np.arrays.
  • We apply the data augmentation.
  • The Albumentations library doesn’t perform the normalisation to the mask, therefore we normalise it ourselves.
  • In the original paper of the UNET they used the “valid convolution” (kernel_size=3, stride=1, padding=0) insted of the “same convolution” (kernel_size=3, stride=1, padding=1). Since we have loaded both images and masks with a shape of [batch_size, channels, height=388, width=388], in order to make our model output a tensor with the same shape, we have to input a tensor of shape [batch_size, channels, height=572, width=572]. To increase the height and width of the input tensor we use a reflection padding (or mirroring) of 92 ((572–388)/2).

TRAIN.PY

This file uses many functions that are defined in utils.py while the hyper-parameters and other variables are located in the config.py.

If you’re new to torch.cuda.amp.GradScaler check it here and then a look at the train_loop() in the utils.file linked above.

These are the steps performed in the train.py:

  • Defines the loss function, the model and the optimiser.
  • Loads the artifacts of model and an optimiser if CHECKPOINT is defined.
  • Imports the dataset’s data-loaders.
  • Then for EPOCHS times it trains the model on the whole dataset, logs in console the valuation metrics (dice-score & the previously defined loss_fn), saves the models and finally saves the images comparing the ground_truth mask with the predicted_mask (output of the model)

RESULTS

I trained the model with a ml.p2.xlarge in SageMaker and the best results where obtained at the 14th epochs when it achieved on the validation set 0.48 of recall and 0.52 of dice score. In the beginning I was quite disappointed about these results, but then I noticed that the models that achieved the best results in the the DAVIS_2017 competition where pre-trained on either ImageNet or COCO datasets. This fact should remind us the power and importance of transfer-learning.

Here below, some of the best and worst predictions on the validation set at the 14 epochs:

Some of the best predictions
Some of the worst predictions

It is quite hard to identify the shared patterns among the bad predictions: i.e. you could argue that in both pictures the brown colour it’s predominant and the colours of the foreground and background are similar. The best way to improve the detection of such objects might be, as always, exploit pre-training on a larger dataset.

CONCLUSION

Before starting a project you should check that your dataset suits your purpose. In my case, the DAVIS datasat was created to support the reseach of video-object-detection and therefore the observations for each class are (almost identical) sequential frames of a video.

These are the images 00034.jpg, 00035.jpg, 00036.jpg, 00037.jpg of the car-turn class

Given my initial purpose of training an algorithm for image-semantic-segmentation, I would have rather chosen another dataset since DAVIS doesn’t provide enough generalisation.

These limits arise noticeably when testing the model in a context that is far from the ones of the DAVIS:

Myself while crossing the Ponte Tibetano of Bellinzona (CH)

This should remind us that in order to train a neural network to detect effectively our target objects, we must provide it with a lot of good quality data that suits our purpose.

I hope that this tutorial was helpful for you and thank you for reading it!

--

--

Alessandro Mondin

Hello! I am an Italian guy working in computer-vision 😺