The power of Deep Learning for EO- Small sample size, big results

Thomas James
Wegaw
Published in
8 min readDec 16, 2020

Deep Learning has rightfully claimed it’s spot at the top of the Machine Learning toolkit, frequently used to extract information from different types of remotely sensed imagery. Deep Artificial Neural Networks are being used to detect and map many geographical features, such as landcover, landslides, avalanches and waterbodies.

In this post, we will do a simple step-by-step example of how snow detection with optical true-colour imagery can be achieved, with a (very!) small 46 sample training datasets.

Disclaimer: At WeGaw, we’re applying Deep Learning methods to earth observation imagery to detect snow and its properties in order to help hydropower companies generate power more sustainably. This post aims to provide a small snapshot of how to harness this technology. It is an example of the capabilities which Deep Learning provides, and not a technology we currently use.

The Sentinel-2 satellite constellation delivers a full global multispectral coverage every ~ six days at 10m resolution. WeGaw’s current optical workflows integrate Normalised Difference Snow Index (NDSI) for snow detection. The NDSI exploits the gradient in the intensity of the reflectance of visible light and Shortwave Infrared light.

NDSI formula used for Sentinel-2 imagery

This method is inherently limited by the fact that it is a calculation performed on each pixel independently, and fails to acknowledge the broader context of the scene. Modern computer vision methods are quickly filling this void.

Background

Deep Learning architectures are complex, diverse and designed to suit their purpose. In this post, We will break down the most critical aspects of the convolutional neural network DeepLabV3, explicitly intended for image segmentation.

Different types of computer vision tasks

Artificial Neural Networks

Artificial neural networks (ANN’s) attempt to approximate the properties of the human brain, and like the brain, ANN’s are optimised by ‘feed-forward’ and ‘backpropagation’.

Feed-forward
A feed-forward neural network consists of layers of connected neurons, including an input layer, hidden layers and an output layer. Each input is assigned a weight (w) and a bias (b). The sum of the inputs is passed to an activation function. The deviation from the output is compared to a target output, and from this, a ‘loss’ is calculated with a loss function.

The structure of an individual artificial neuron

Backpropagation
The gradient descent is the vector of partial derivatives that indicates the direction to which the input function most quickly increases. Gradient descent iteratively minimises a function, like a ball bouncing down a hill to reach the lowest point.

The concept of gradient descent is fundamental to the optimisation of machine learning algorithms. A very good explanation can be found here: https://www.youtube.com/watch?v=sDv4f4s2SB8

The gradients of the parameters with respect to the network’s weights are calculated, and the weights at each node are updated. Repeating this process allows the ANN to derive parameters that optimally represent the dataset iteratively.

Convolutional Layer

A convolution kernel is a 2D feature detector matrix that moves across the input matrix to generate stacks of feature maps. Feature maps filter images to draw the key characteristics within the image. A simple example of this is vertical and horizontal lines, which enable the convolutional backbone to encode textural features within the image. Here is an example of an edge detection filter. The filter seems to have extracted the fur from the face of a labrador with a simple 3 x 3 edge detection filter.

Result of a 3 x 3 Edge detection filter

Stacks of feature maps are generated within the convolutional backbone. Convolutional kernels ‘convolve’ the entire 2D image to create stacks of feature maps that represent a characteristic; this could include anything from vertical edges, colours and textures.

Example of a convolutional feature map generated from a feature detector

Atrous Spatial Pyramidal Pooling (ASPP)

Computer vision tasks often require convolutions that extract and aggregate features across multiple scales. For example, if the task is to segment a dog, then the CNN could learn what the texture of fur looks like (as shown with edge detection filters), or what are the features common to dogs, like ears/eyes/whiskers. This is where the scale and depth of convolutional filters become essential. In theory, we could convolve at all scales and kernel sizes, with a vast aggregation of feature maps. This, however, would come at an impractical computational cost. A compromise has been reached with a ‘dilated convolutions’ — also known as ‘atrous’ (‘with holes’). By spacing out the field of view of the filter, it is possible to generate feature maps of varying scales without increasing the computational cost. The size of these ‘holes’ is known as the dilation rate.

Choosing the Model

For this task, the naive decoder CNN ‘DeepLabV3’ model was chosen. The model architecture exploits the use of ASPP.

Depiction of the DeepLabV3 Architecture

Dataset Preparation

The quality and volume of the dataset and labels is nearly always the limiting factor to deep learning performance. While the sophistication and performance of CNN architectures have increased in recent years, practitioners often stick to simpler, more primitive networks for earth observation tasks, while focusing the more effort toward curating large high-quality datasets. In the case of building a snow cover dataset, the snow pixels were identified using some photoshop tools.

Labelling snow pixels with photoshop magic wand tools

This highlights one of the biggest hurdles facing the progress of deep learning for EO tasks: — What methods can we use to label EO data, and how accurate will it be? This problem is discussed in greater depth by Lex Fridman and Jitendra Malik in this podcast: → https://www.youtube.com/watch?v=LRYkH-fAVGE&t=2688s

The Result:

The model was tested on a section of imagery that had not been shown to the model during the training phase. Here is the result:

snow prediction vs snow ground truth

Performance Evaluation

The similarity of the segmentation prediction and ‘ground truth’ indicates the quality of the prediction. IoU computes a ratio between the intersection and the union of the prediction and the ground truth. This returns a value between 0 and 1. A value of 1 indicates a segmentation result that perfectly matches the ground truth.

IoU Score = 0.81

We can see by overlaying the prediction and ground truth, that a large number of false negatives occurred where smaller patches of snow exist. This can most likely be attributed to the use of a tiny dataset. The CNN learned to represent the high-level features but was much less successful at detecting the subtle detail within the more trivial sections of snow. CNN’s designed for segmentation are also known to generalise over complex features within images. There is potential to reduce this phenomenon by adjusting the dilution rates of the atrous convolution kernels to suit the clustered nature of the snow better. Alternatively, we can experiment with entirely different network architectures, perhaps an encoder-decoder network U-Net, or DeepLabV3+, we could go even further and attempt to produce shaper results by adding more skip connections between the encoder and the decoder.

Practical Usage

Whether training, testing or deploying the model, the test images will need to be in the form of 3x 244 x 244 tensors. It’s important to remember that we are dealing with geospatial data, so we need to extract the metadata and an array easily using the Rasterio module.

The CNN will return 244 x 244 arrays of values that indicate probabilities between 0 and 1. We can observe a histogram of the distribution of the probabilities and threshold the values to generate a binary snow mask. The 244 x 244 snow masks can then be rebuilt back into the original full-size image. It is also important to note when writing the binary mask to a GeoTIFF; the original dimensions of the image must be the same as stated in the metadata. By resizing the original arrays to ensure that the remainder of the array width and height is zero after being divided by 244, then resizing the rebuild snow mask back to the original image dimensions.

Here is some example python code that could be adapted to achieve this.

import rasterio
from torch
from skimage.transform import resize
from cnn_processing_toolkit import split_image, rebuild_image
model = torch.load('your_trained_model_path.pt')
model.eval()
# Extract metadata and array
with rasterio.open('your_geotiff_path.tif') as src:
meta = src.meta
array = src.read()
# Resize the array for the correct splitting dimensions
array_height = int((array.shape[0] / 244)) * 244
array_width = int((array.shape[1] / 244)) * 244
assert array.shape[1]==244 and array.shape[2]==244
resized_array = resize(array, (array_height, array_width, 3))
# Splice the input array
samples = split_image(dims=244, input_im=resized_array)
# Load the model and generate prediction and threshold
threshold, prediction_masks = 0.2, []
for sample in samples:
prediction_masks.append((model(torch.from_numpy(sample)
.type(torch.FloatTensor)/255))['out']
.cpu().detach().numpy()[0][0]<threshold)
# Rebuild the image and resize it to the original dimensions
prediction_array = rebuild(prediction_masks)
prediction_array = resize(prediction_array, (array[1], array[2]))
# Export to GeoTIFF
with rasterio.open('your_output' + '.tif',
mode='w',
driver=meta['driver],
dtype=meta['dtype'],
width=meta['width],
height=meta['height'],
crs=meta['crs'],
transform=meta['transfrom],
count=meta['count'],
nodata=0) as dst:
dst.write(prediction_array, 1)

Conclusion

Achieving 80% accuracy with only 46 training samples demonstrates how powerful deep learning tools can be for EO tasks. While the mapping of snow in optical imagery is not a problem that demands attention, it’s easy to see how these methods could are useful in other ways. For example, segmentation of cold clouds and snow, mapping snow under forest canopies, snow cover extent mapping with polSAR decompositions are all challenges that we at WeGaw are tackling with the help of deep learning.

Get in touch

Please feel free to make comments, suggestions and criticisms about this post. Get in touch through LinkedIn/email

--

--