Aerial Semantic Segmentation using U-Net Deep Learning Model

Aimal Rehman
6 min readDec 30, 2021

--

In this article, I am going to explain and apply the U-Net model to the Drone dataset for pixel-wise classification. Let us have a look at the U-Net model first.

U-Net for Semantic Segmentation:

U-Net [1] is built upon the Fully Convolutional Neural Network (FCN) first introduced in [2] and is used for Semantic Segmentation tasks. FCN was a breakthrough in the line of progression for image classification in the sense that they can perform fine-grained predictions i.e. pixel-wise classification on images. This helps in object detection and localization.

The U-Net model is trained end-to-end and pixel-to-pixel on Semantic Segmentation. It follows the same Encoder-Decoder architecture as FCN, does not contain any dense layer, and in its Decoder part, contains a large number of feature channels to pass on the contextual information towards the higher resolution layers.

The network architecture of the U-Net model presented in [2] is shown below. In this article, the U-Net model is trained over a Drone dataset to perform detect a range of objects in aerial imagery.

U-NET architecture presented in [2]. Notice how it makes the shape of a “U”.

One of the advantages of U-Net architecture is that it can be trained well on the dataset with very few images. In the original paper, it was designed and trained for Biomedical Image Segmentation tasks. The predictive goal was to classify each pixel in1) contains cancer cells or 2) does not contain cancer cells and was, therefore, a pixel-wise binary classification algorithm. For such tasks, datasets with very few images are usually available. And it was shown that this architecture does not require a humongous amount of image data to learn a pattern. However, the data set was still augmented by applying elastic deformation on the images to learn the deformation invariance.

As mentioned before, the network architecture consists of an Encoder and a Decoder. The Encoder consists of a contracting path and the Decoder, an expansive path. In the image above, the left side of the architecture is the Encoder and the right side is Decoder. The Encoder has the same architecture as that of a Convolutional Neural Network (ConvNet) except that the dense layer of ConvNets is transformed into a convolutional layer. The U-Net model is then an extension of a ConvNet with an additional convolutional layer that performs upsampling of the input features by performing deconvolution operations.

How is FCN formed from a typical ConvNet?

Let us first understand how is FCN (or U-NET) a transformation of a ConvNet. A typical ConvNet e.g. LeNet, AlexNet has a fully connected layer that has fixed dimensions and throws away spatial coordinates. That means that it loses the information of “where” but retains the information of “what?” in an image and can, therefore, perform image classification. This coarse classification network can be converted into a fine-grained classification network if these fully connected layers are viewed as convolutions with kernels that cover the entire input regions. The output of such a network would be a classification map or a heatmap. Next, we need ground truth at each output cell to optimize the convolution operations. Images need to come up with masks having output information on each pixel/cell. With the addition of further layers and a spatial loss function, the network can perform end-to-end dense learning.

While a general deep neural net computes a nonlinear function, a net with only layers of this form computes a nonlinear filter or a deep filter. Suppose a data vector x{ij} at a layer f{ks} with kernel size k and subsampling size s then the data vector to the next layer yij is:

where k is called the kernel/filter size, s is the subsampling stride, and fks determines the layer type: convolution, average pooling, max-pooling, or elementwise activation function. This functional form is maintained under a composition of layers since the filter size and the stride follow the transformation rule as follows:

Data Set Description:

The Semantic Drone Dataset focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The dataset contains 400 publicly available images. The pixel-accurate annotation for the same set is also available. The complexity of the dataset is limited to 23 classes listed below:
[tree, grass, vegetation, dirt, gravel, rocks, water, paved area, pool, person, dog, car, bicycle, roof, wall, fence, fence-pole, window, door, obstacle, ar-marker, bald tree, conflicting] and unlabeled class.
Dataset can be found at this link: http://dronedataset.icg.tugraz.at

Multiclass Semantic Segmentation with U-NET:

In [2], U-Net was originally designed to perform binary classification. With very little modification in the final layer according to the dataset and the desired classification task, U-Net model can perform multiclass classification task.

Implementation

Results and Predictions

Initially, my model suffered in achieving both training accuracy and validation. So, I tried to turn around a few knobs like:
1) Increasing the data by performing deformation on the images
2) Played around the batch size from 1 to 8. ( Noticed that it was doing better at 4, so I kept it !!!)
3)
Changing the Learning Rate (LR) of Keras’s Adam optimizer from default (10^-3) to LR of (10^-4)
5) Weight Initialization techniques. I noticed by using ‘he_normal’ initialization method, my model was able to go above 80% on both training set and validation set. So I concluded my experimentation after this.
I am sharing a couple of results below.

Model training with augmented data, LR of 10^-4, Batch size of 4, and random weight initialization method.
Model training with augmented data, LR of 10^-4, Batch size of 4, and he_normal weight initialization method.

A few predictions on the tet set along with the available ground truth are given below.

There are a lot more things I wanted to try since there is a lot of room for improvement here but couldn’t because of the time limitations:
1) Check performance on other initilization methods like Xavier initialization
2) Experiment with an LR that decreases systematically over epochs
3) May be try other activation functions although ReLU works better usually !
4) Explore the opportunities withing the network architecture !!!

References
[1] Long, Jonathan, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

[2] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation.” International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.

--

--