Nimbus — Cloud Segmentation using Deep Learning for Agriculture.

Gulfaraz Rahman
Jan 4, 2019 · 9 min read

Nimbus is built to remove clouds and the shadows cast by these clouds in Sentinel-2 satellite images gathered from the Copernicus portal. We consider clouds as noise that needs to be removed in order to monitor agricultural land.

Image for post
Image for post
Problem scope in the high-level overview.

The Problem

Follow the link below to understand why we want to solve this problem,

The Dataset

Image for post
Image for post
Image for post
Image for post
Satellite image with clouds and shadows (left) / Satellite image with clouds and shadows removed (right).

The Solution

The input is the satellite image of a region covered with clouds. The difference between the input image and the annotated image (input image with clouds removed) results in a mask which serves as the target for our model. The mask is a binary image — with a value of 0 or 1 for each corresponding pixel of the input image. We choose the U-Net architecture [1] for our model for this binary segmentation task which is similar to SegNet architecture with shortcut connections from the encoder to the decoder.

Image for post
Image for post
SegNet segmentation using Deep Learning.

Training Pipeline

  • Add invariance using data augmentation
  • Training the deep network
  • Post-processing and model evaluation

Bootstrap image pairs from the dataset

With only 21 image pairs for training, our dataset is very small to perform any reasonable learning. It also suffers from class imbalance, as most images have clouds in them which would lead the model to believe there will be clouds in every image. The training image size and test image size differ but we would like our model to work well on all image sizes. Each image is huge, it takes unreasonably large memory space to load the whole image then process it. We tackle all of these issues with one technique — bootstrapping.

Image for post
Image for post
Benefits of bootstrapping.

Bootstrapping allows us to infinitely sample smaller images from the larger images which transform our small dataset to a very large dataset. One image without clouds has now become infinite samples of smaller images without clouds thus alleviating the impact of class imbalance. We can fix the size of the sampled images which resolves the issue of varying image sizes. By choosing a small size we can also avoid the problem of large images thus allowing us to load multiple images into memory for batch training.

Bootstrapping inflates our training data points from 21 to 210,000 by sampling 100,000 images of 32x32 size from each of the 21 training images. Given enough samples, the class imbalance problem diminishes.

Add invariance using data augmentation

Data augmentation is a common technique to inflate datasets by including rotation and mirrored copies of the original data points. Using bootstrapping, we have resolved the size problem of the dataset but could still use the other benefits of data augmentation. By training the model with rotated and mirrored variants we teach our model to be more robust to such changes. A rotated cat is still a cat, our problem is to mask the cat — or cloud. We train the model with combinations of 90/180/270-degree rotations with horizontal and vertical flips.

Training the deep network

The input images are passed through the network which performs convolutions and transformations to produce a binary mask (using a sigmoid). The generated binary mask is compared with the target mask to estimate a difference or loss. We train and compare models using three different functions — Binary Cross Entropy (basic), Lovasz Hinge (Jaccard approximation) and Iglovikov (= BCE - log jaccard_approx) [2] losses. The non-linearity used between convolutions is ReLU and regularization is done using batch normalization. The models are trained for 100 epochs each.

Post-processing and model evaluation

With the above systems in place, we were successful in training models which could predict segmentation masks. Prediction introduces a threshold hyper-parameter which is used to binarize the output mask. The models' raw output is an image with each pixel value in the range [0-1], but it is preferred to apply a binary mask on the input image. All pixel values below the prediction threshold are set to 0 and the rest are set to 1. When the binarized mask is applied on the input image, it produced rough edges and insignificantly small holes in the clouds, we apply a 5x5 Gaussian filter to smooth out the edges and fill the small gaps.

Image for post
Image for post
Inspiration for Test Time Augmentation (Source [2]).

At this stage, the results suffered from an unexpected effect. The predictions had visible blocks due to the splitting and merging of the large image. A Kaggle [2] team resolved these ‘local boundary effects’ by shifting the boundary pixels to the centre with a fixed offset to make a second prediction and use the mean. A second prediction increases execution cost but resolved the block effect and also lowered variance as a side effect. Using the mean prediction smooths the confidence of the predicted output. To further reduce the variance in the output we replicate data augmentation method at prediction. We predict on rotated and mirrored copies of the input and average the predictions for each pixel.

Hyper-parameter Tuning

  1. Input Image Size — 16x16, 32x32, 64x64, 128x128
  2. Loss — BCE, Lovasz, Iglovikov
  3. Prediction Threshold — 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9
  4. Learning Rate — 0.1, 0.01, 0.001

The evaluation scores are of the validation set — data unseen by the model.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Influence of Input Image Size at different thresholds for models trained with the different loss functions on Jaccard score.

From the above graphs we observe that Input Image Size 32x32 outperforms the other sizes on average (consistently on Lovasz Loss). The curves peak at prediction threshold values 0.5 or 0.6.

Image for post
Image for post

To have a closer look at the influence of loss function we fixed the input size to 32 and observed the model performance. The learning rate was set to 0.01 with Adam as the optimization algorithm. We reiterate that the hyper-parameters were chosen to maximize robustness to boost generalization of the model.

Putting together the components described above:

Image for post
Image for post
A high-level processing pipeline.

Results

╔════════════════╦══════════╦══════════╦═══════════╗
║ ║ BCE ║ Lovasz ║ Iglovikov ║
╠════════════════╬══════════╬══════════╬═══════════╣
║ Test Score ║ 0.9786 ║ 0.9786 ║ 0.9734 ║
║ Threshold ║ 0.5 ║ 0.6 ║ 0.5 ║
╚════════════════╩══════════╩══════════╩═══════════╝

Not all settings performed well but there were a few with high test scores, allowing us a few options to choose from.

We visually observe an example from the model trained on BCE loss,

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Input (Top Left), Ground Truth (Top Right), Masked Input (Bottom Left) and Predicted Mask (Bottom Right) Images.

This is another example prediction visualized for different values of prediction thresholds,

Image for post
Image for post
Image for post
Image for post
Ground Truth and Predicted Masks for different threshold values.

We were able to improve results for specific cases but in general our original choice of 0.5 generalized best.

We attempted to trick the model with an input image with no clouds,

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Input, Ground Truth and Masked Input for a satellite image with no clouds.

The model correctly predicted that there were no clouds in the input image by generating a (nearly) blank mask.

Taking a closer look at our first example we see that the model overrides some areas of the ground truth. The blobs of the mask at the centre of the image are not found in the predicted mask but in their place we find smaller chunks corresponding to the small clouds clearly visible in the input image.

Image for post
Image for post
Image for post
Image for post
Visual comparison of a) input and masked input images (left) / b) ground truth and predicted mask (Right).

There are a few such examples which indicate that the model has understood what clouds really look like and does not simply follow the provided ground truth. These differences bring down the score but at this point, we agree with our model’s prediction over the ground truth — at least for this instance.

From the masked input image, we can see that the model was able to remove all clouds. Although the learning is in the right direction we believe there is scope for improvement in terms of smoothness and generalization. Additional fine-tuning and parameter testing can further improve the model performance. More importantly, we need a more robust evaluation method than the Jaccard score and manual visual inspection — this is a common problem in all image comparison tasks.

Without any modifications to this solution, we trained a model to detect plots in satellite images. The details of that experiment are described in,

Future Work

  • As in all Deep Learning solutions, the more (diverse) data is learned from the better.
  • We trained under different weather conditions from images captured throughout the year. Training with images from more diverse conditions (in day and night) and from other regions will help build a more robust solution for cloud segmentation.
  • The input data used only includes RGB-NIR channels, newer satellite images have additional image channels which may capture information useful to make better predictions.
  • We did not use traditional methods [4] in the field of geo-information, using indexes such as NDVI, EVI may help achieve faster convergence.
  • The images used were at 10m/pixel resolution, there are images of higher resolution which provide finer details which are ideal for Plot Detection.

References

  1. Dstl Satellite Imagery Competition, 3rd Place Winners’ Interview: Vladimir & Sergey (http://blog.kaggle.com/2017/05/09/dstl-satellite-imagery-competition-3rd-place-winners-interview-vladimir-sergey/)
  2. Bergstra J, Bengio Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research. 2012;13(Feb):281–305.
  3. Remote Sensing Indices — https://www.indexdatabase.de/db/i.php

Thank you to Prof. Zeynep Akata (University of Amsterdam) and Gerbert Roerink (Wageningen University and Research) for enabling the project.

Just A.I.

Simple and Reliable Solutions for Scalable Businesses

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store