Segmentation of Clouds in Satellite Images Using Deep Learning

Jingxi Li
The Startup
Published in
8 min readDec 10, 2020

As a UCLA AOS 204 Final Project Report

Introduction

Recently, detection of the clouds in satellite images is an important issue during analyzing and utilizing these images. On one hand, clouds may occlude objects on the land and result in huge difficulty for many remote sensing applications including change detection, geophysical parameter retrieving, and object tracking. On the other hand, the cloud itself may contain useful information that is related to the climate and natural disasters such as hurricanes. A lot of solutions have been proposed to execute such cloud detection tasks, including a number of handcrafted approaches and threshold-based algorithms, and demonstrated success in some simple images. However, it remains extremely challenging to achieve satisfactory segmentation results in some complicated images, for example, when fragmented clouds with varying thickness are mixed up with complex geomorphological background. Also, these traditional approaches might come across with problems if the spectral information at hand is not rich, for example, when the images consist of < 5 wavelength bands.

In this project, I presented a deep-learning based algorithm to solve this problem by identifying clouds and performing segmentation on the given multispectral satellite images, from which the post-processing and applications of these images will benefit significantly. In order to learn the one-to-one mapping between spectral and spatial information of the input images and their semantic segmented ground truth, one type of full convolutional neural networks named U-Net [1] is employed to interpret and extract the information embedded in the satellite images in a multi-channel fashion, and finally output a pixel-wise mask indicating the existence of cloud. Before this, the U-Net structure has demonstrated its huge success in a variety of applications, for example, medical image segmentation [1], [2], image super-resolution [3] and imaging modality translation [4], etc. Here, I trained the original version of U-Net and one of its most important variants, attention gate-based U-Net [5], to perform this cloud segmentation task, and a quantitive comparison of their results on the testing dataset is also presented to evaluate their performance on this task. The results that are produced without any fine-tuning are also coming close to the performance of the network trained and fine-tuned by the authors who generated this dataset [6].

Results and Discussion

Figure 1 shows one of the prediction results of the cloud segmentation using attention U-Net. In this sample, the mottled clouds and the intricate landscape are completely mixed together in the input image but are then clearly classified and segmented by the neural network. It is clear that the segmentation map matches the labelled ground truth very well and preserves the majority of the salient features, except for that certain small features scattered around the salient ones are not successfully identified, as presented in the 2nd row of Fig. 1.

Figure 1. An Example of the segmentation results. (Top row) A sample taken from the blind testing dataset showing a large field-of-view (FOV) multispectral satellite image, its cloud segmentation results inferred by the trained network and its labelled ground truth. The white indicates the existence of clouds and the back represents the non-cloud regions. (Bottom row) A zoom-in of the region of interest (ROI) marked with a red dashed-line box in the top row images.

Furthermore, a quantitive analysis is performed on these results. Several typical statistical metrics were selected to evaluate the results, which are precision, recall, specificity, Jaccard score and accuracy. These metrics were calculated over the whole blind testing data automatically using the two models respectively and presented in Table 1. When comparing the numbers across the two rows, it can be clearly found that the attention gate-based U-Net takes lead in almost all the metrics. This observation leads to a clear conclusion that the performance of the attention U-Net has a significant advantage over the standard U-Net, which achieved an accuracy of 96.34% over the whole testing data, which is coming close to the fine-tuned U-Net [6] which was trained by the authors of the dataset. A potential direction of improving the results is fine-tuning the training hyperparameters and further exploiting the loss functions used to construct better mapping with the outputs and the ground truth.

Table 1. Summary of results evaluation.

In summary, this deep learning-based approach has shown its success on this cloud segmentation problem with a decent performance. The precise pixel-wise recognition and segmentation of the cloud layer in satellite images has been performed successfully, demonstrating the efficacy of deep learning and convolutional neural networks.

Methods

Data Preparation. The data I plan to use will be obtained from Kaggle, titled “38-Cloud: Cloud Segmentation in Satellite Images” [7], which contains 38 Landsat 8 scene images and their manually extracted pixel-level ground truths for cloud detection. The entire images of these scenes have been preliminarily cropped into multiple 384 × 384 patches such that it can be conveniently fed into the algorithms. In the dataset, there are in total 8400 patches for training and 9201 patches for testing. Each input image patch has 4 unique spectral channels, which are corresponding to Red (band 4), Green (band 3), Blue (band 2), and Near Infrared (band 5) in the Landsat 8 sensing data, each containing an array of integer values ranging from 0 to 255 (thus the bit depth is 8). The corresponding ground truth is a binarized mask image containing only 1 (cloud) or 0 (non-cloud) as its pixel intensities. The data size of the training set and testing set is 5.45GB and 6.01GB, respectively. The first 1000 images in the training set were held out to be used as the validation set, such that the actual number of images used for training is 7400. After inferring our trained network on the testing dataset, a stitching process was applied to the patches to form the whole large field-of-view images like the one shown in the first row of Fig.1.

Network implementation. A standard version of U-Net and an attention gate-based U-Net (encoder — decoder with skip connections and attention gates) were implemented respectively to learn the transformation from the input multispectral satellite images to the segmentation maps. The networks are adapted to work on input distributions, matching the input tensor with 4 channels (corresponding to the 4 different spectral bands) with its labelled segmentation ground truths. Inside the standard U-Net architecture, as shown in the top panel of Fig. 2, there will be a downsampling path and a symmetric upsampling path. In the downsampling path, there will be four convolution and downsampling blocks or levels. Each block consists of (1) three 3×3 successive convolutional layers with batch normalization layers and leaky rectified linear unit (leaky ReLU) in between to extract and encode spatial features, and (2) one 2×2 average pooling layer with a stride of 2×2 to perform a 2x downsampling. There will be also a residual connection between the first and last tensor in each block. In the upsampling path, there are five corresponding convolution and upsampling blocks. The input of each block is a channel dimension concatenation of the output tensor of previous block in the upsampling path and the attention gated output tensor at the corresponding level in the downsampling path, which create skip connections between the upsampling path and downsampling path. It is worth noting that to alleviate irrelevant spatial information propagated in the simple skip connection of the standard U-Net, the attention U-Net employs soft attention gate blocks [5] in each skip connection, as shown in the bottom panel of Fig. 2, including a few convolutional layers and a sigmoid operation to calculate the activation weighting maps, such that the feature maps from the downsampling encoder path are pixel-wise multiplicatively weighted and propagated to the upsampling decoder path. The structure of the upsampling block is quite similar to the downsampling path, except for that (1) the pooling layers will be replaced by 2x bilinear upsampling layers and (2) there will be no residual connections.

Figure 2. (Top) Schematic of the structure of the attention gate-based U-Net. (Bottom) Block flow chart of the attention gate part.

Loss function. In this project I applied the simplest loss function used in the image reconstruction/segmentation problem — Mean Absolute Error loss (or MAE loss), which is defined through taking the average of the pixel-wise absolute difference between the predicted value and the ground truth value over the entire image and batch. A more delicate loss function may be considered to further improve the current results, such as Jaccard loss or Huber loss. The evolution of this loss during the network training process is plotted in the Fig. 3.

Figure 3. The learning curve showing the training or validation loss’s variation over the number of iterations

Training details. The convolutional networks presented in this project were implemented on Google TensorFlow (v1.15) platform, an end-to-end open source platform designed particularly for the implementation of machine learning algorithms, using a series of built-in functions such as tf.layers, tf.nn, tf.loss, etc. An Adam[8] optimizer with default hyperparameters is utilized to calculate the gradients of the weights in the network. A Windows PC equipped with a GTX 1080 Ti graphical processing unit, Intel Core i5 9400F central processing unit (CPU, Intel Inc.) and 16 GB of RAM is used to train the models mentioned above. The learning rate is set as 0.0001 and a batch size of 12 is used. The networks were trained with 200,000 iterations, which is around 25 epochs. The typical training time of such U-Net model is ~12 hours.

Reference

[1] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention — MICCAI 2015, Cham, 2015, pp. 234–241, doi: 10.1007/978–3–319–24574–4_28.

[2] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation,” arXiv:1606.06650 [cs], Jun. 2016, Accessed: Jul. 03, 2019. [Online]. Available: http://arxiv.org/abs/1606.06650.

[3] H. Wang et al., “Deep learning enables cross-modality super-resolution in fluorescence microscopy,” Nature Methods, vol. 16, no. 1, p. 103, Jan. 2019, doi: 10.1038/s41592–018–0239–0.

[4] Y. Rivenson et al., “Virtual histological staining of unlabelled tissue-autofluorescence images via deep learning,” Nature Biomedical Engineering, p. 1, Mar. 2019, doi: 10.1038/s41551–019–0362-y.

[5] O. Oktay et al., “Attention U-Net: Learning Where to Look for the Pancreas,” arXiv:1804.03999 [cs], May 2018, Accessed: Mar. 15, 2020. [Online]. Available: http://arxiv.org/abs/1804.03999.

[6] S. Mohajerani and P. Saeedi, “Cloud-Net: An end-to-end Cloud Detection Algorithm for Landsat 8 Imagery,” arXiv:1901.10077 [cs], Jan. 2019, Accessed: Nov. 05, 2020. [Online]. Available: http://arxiv.org/abs/1901.10077.

[7] “38-Cloud: Cloud Segmentation in Satellite Images.” https://kaggle.com/sorour/38cloud-cloud-segmentation-in-satellite-images (accessed Nov. 05, 2020).

[8] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv:1412.6980 [cs], Dec. 2014, Accessed: Apr. 16, 2019. [Online]. Available: http://arxiv.org/abs/1412.6980.

--

--