Mount Taranaki cuts through the clouds. Sentinel-2 image from 2017–12–15 via Sentinel Hub with a cloud mask produced with the Sentinel Hub Cloud Detector at three different zoom levels.

Improving Cloud Detection with Machine Learning

Anze Zupanc

Published in

Planet Stories

12 min readDec 19, 2017

This is a story about clouds. Real clouds in the sky. Sometimes you love them and sometimes you wish they weren’t there. If you work in remote sensing with satellite images you probably hate them. You want them gone. Always and completely. Unfortunately, the existing cloud masking algorithms that work with Sentinel-2 images are either too complicated, expensive to run, or simply don’t produce satisfactory results. They miss to identify clouds too often and/or like to identify clear sky over land as clouds.

The lack of a good cloud masking algorithm is also our problem at Sentinel Hub, therefore we’ve decided to tackle this challenge with our research team. We hope that we’ve managed to develop a fast machine learning-based algorithm that detects clouds in real time and gives state-of-the-art results among single-scene algorithms. We wish to share our cloud detector and how we’ve developed it with you.

Sentinel-2 image of Betsiboka Estuary recorded on 2017-12-15 via Sentinel Hub without and with a cloud mask produced with the Sentinel Hub Cloud Detector. The algorithm correctly identifies regions of thin and transparent clouds. Move the slider for a better visual inspection.

Everything else from here on gives a more detailed description of the problem and our solution, which you may skip if you’re not interested in technicalities. But make sure not to miss the validation results and comparison with other algorithms on the market at the very end. And of course, all the pretty pictures.

Cloud masking of Sentinel-2 images in real time using Sentinel Hub services and Sentinel Hub Cloud Detector.

Cloud detection algorithms in a nutshell

Cloud detection is the most crucial step during the pre-processing of optical satellite images. Failure to mask out the clouds from the image will have a significant negative impact on any subsequent analyses such as change detection, land-use classification, crop monitoring in agriculture, etc. Multiple algorithms are currently in use to identify pixels contaminated by clouds. Here we list only a few, which are to our knowledge also most widely used and known. They can be roughly divided into two groups, depending on whether they use a single acquisition or multiple acquisitions recorded by the satellite at different dates:

Fmask (paper): According to this very recent study the Fmask algorithm provides the best overall accuracy among many algorithms tested on 96 Landsat 8 scenes. Fmask belongs to the single-scene-algorithm group.
Sen2Cor: This is the algorithm currently used by European Space Agency for atmospheric correction of Sentinel-2 images, which among others provides also scene classification maps and quality indicators for cloud probabilities. Sen2Cor’s cloud detector belongs to the single-scene-algorithm group. If you’re using Sentinel-2 imagery, you’re probably familiar with Sen2Cor’s cloud masks.
MAJA: This algorithm performs atmospheric correction and cloud detection in Sentintel-2 images using time series, which can help to avoid over detections of clouds by utilizing the correlation of the pixel neighborhood with previous images. It is very unlikely that two different clouds with same shape appear at the same location on successive dates. MAJA belongs to the multi-temporal-algorithm group and as the results presented below show gives the best results. However, it is not easy to use and is expensive to run (as pointed out here). If you’re very lucky, your region of interest is already processed and freely available to download.

Under the hood, all three algorithms implement a set of static or dynamic thresholds for various Sentinel-2’s spectral bands and apply them for the detection of clouds. We opt for a different approach — the machine learning approach.

Sentinel-2 image of snow covered Etna recorded on 2017-12-14 via Sentinel Hub without and with the Sentinel Hub Cloud Detector.

Sentinel Hub’s pixel-based cloud detector

Our aim is to develop a single-scene cloud detection algorithm that works on a global scale and heavily relies on machine learning techniques. We’ve started with the most simple approach — so called pixel-based classification — where we assign each image pixel a probability being covered with a cloud solely based on satellite’s spectral response for that pixel. Calculation of a pixel’s cloud probability therefore doesn’t depend on its neighborhood. We only take the broader context (pixel’s neighborhood) into account when we construct the cloud mask of a given scene from its cloud probability map by performing so called morphological operations, such as convolution. We believe that machine learning algorithms which can take into account the context such as semantic segmentation using convolutional neural networks (see this preprint for overview) will ultimately achieve the best results among single-scene cloud detection algorithms. However, due to the high computational complexity of such models and related costs they are not yet used in large scale production. We’ve decided to develop first the best possible pixel-based cloud detector, see how it compares to other existing algorithms, and last but not least learn in the process. In the future, we will also use deep learning techniques to detect clouds. But this is something for the future.

Sentinel-2 image of Sydney recorded on 2017-12-06 via Sentinel Hub without and with the Sentinel Hub Cloud Detector.

Training and validation samples

The success of any machine learning application depends heavily not only on the quality and size of the training data but also on the usage of an appropriate validation set.

To our knowledge, there exists only one publicly available data set of manually labeled Sentinel-2 scenes. Over the last couple of years, Hollstein et al. curated a data set consisting of around 6.4 million hand labeled pixels from 108 Sentinel-2 scenes sampled roughly evenly from around the globe and throughout the year to ensure full climatic coverage.

Location of 108 Sentinel-2 scenes used by Hollstein et al. for manual classification. 15 scenes indicated with yellow marker represent scenes that are processed by MAJA and available for download. We use these to validate MAJA.

Each pixel is labeled with one of six classes: clear (land), cloud, shadow, snow, cirrus, or water. Authors used this data set to train and validate different machine learning algorithms to detect clouds. They report good results using a random and independent subset of their training set.
The raw number of pixels in their data set being huge is in our opinion misleading. The pixels are sampled from larger labeled polygons drawn by human labeler and are thus by construction very correlated. The Hollstein data sample is effectively much smaller and any classifier trained on it will not generalize well. Therefore we do not use it as our training set but rather as a validation set.

The lack of large and high quality labeled data sets of Sentinel-2 data is a real problem in our opinion. It is slowing down the progress in the development, validation and comparison of cloud detection algorithms for Sentinel-2 imagery. We are actively tackling this problem in another project:

Crowdsourcing EO training datasets to improve cloud detection

The International Institute for Applied Systems Analysis (IIASA) has joined Sinergise to engage the public in an…

medium.com

Not having a large high quality labeled data set that we could use as our training data sample we turned to a second best option — to use the best currently available cloud detector and its cloud masks as a proxy for ground truth. We chose MAJA multi-temporal processor since it has, based on our experiences, a high cloud detection rate and a very low misclassification rate of no cloudy pixels. Usage of a machine labeled data set unavoidably introduces noise in labels which is not ideal, but on the other hand a machine curated data set can be produced very fast with no or very little costs.

When we started this project, there were over 40,000 MAJA products available to download each of them containing the raster cloud mask. To speed up the download process, we decided to download only the tiff files containing masks and not the entire products. In addition, we selected only products with image cover above 90% and cloud cover between 5% and 90%. In total, we have downloaded masks for 14,140 Sentinel-2 tiles, out of which 596 are geographically unique (tiles can cover the same location but are recorded at different dates). These 596 unique tiles include 77 different countries from Europe, Asia, Africa, North and South Americas, and Oceania. This is the closest that we could get to a truly global data set.

Distribution of 596 tile locations from which we randomly sample pixels to build our training set.

Each of the 14,140 Sentinel-2 tiles contains over 120 million pixels at 10 meter resolution, which is way too much to be used in training. Instead, we sampled randomly 1000 pixels from each tile and downloaded reflectances for all thirteen Sentinel-2’s bands using our Sentinel Hub services. In total, we have made over 14 million requests in a short period of time and we didn’t notice even a glitch in our Sentinel Hub services. Our training sample at the end consists of around 14 million pixels, 47% of them being classified by MAJA as cloudy pixels.

The lack of a validation set or a poorly chosen validation set can lead to a complete failure of a seemingly impressive machine learning model when implemented in production (check this blog for very nice discussion of the issue). Cloud detection or any other remote sensing application is very susceptible to this problem. Simply taking a random and independent subset of the training data to validate a model is not enough. The validation set must be representative of the new unseen data, which in the field of satellite imagery means it has to come from geographically scattered data covering all possible land cover types, such as water, bare land, forests, snow, grass, cropland, urban areas, etc. The Hollstein data set fits the bill of validation set perfectly. It consists of diverse geographic areas not covered by our training set and it’s of high quality since it was curated by a human labeler.

In addition we use as validation set also over thousand 64x64 patches labeled in-house with our own ClassificationApp. We used it for the feature and model selection process, while keeping the Hollstein data set for the final validation and comparison with other algorithms.

Sentinel-2 image of Maui recorded on 2017-12-12 via Sentinel Hub without and with the Sentinel Hub Cloud Detector.

Model selection and hyper-parameter tuning

Once we had the training and validation sets in place, we started experimenting with input features, models, hyper-parameters, and whatnot in order to see what works best. Here’s the executive summary:

Feature space: raw band values (Bi), pairwise band differences (Bi -Bj), pairwise band ratios (Bi/Bj), and parwise band differences over their sums (Bi-Bj/Bi+Bj). We didn’t observe significant improvement when using derived features instead of raw band values. Our final classifier uses the following 10 bands as input: B01, B02, B04, B05, B08, B8A, B09, B10, B11, B12.
Models: decision trees, support vector machines, Bayes tree, gradient boosting of tree based learning algorithms with XGBoost and LightGBM, and neural networks with fastai/PyTorch. Neural net gives the best accuracy but is considerably slower at inference time. The second best results are achieved with XGBoost and LightGBM. We opted for the latter since it is faster during training and inference.
Hyper-parameters: We did some hyper parameter tuning for all different models. The tuned hyper-parameters of selected model (LightGBM) are determined to be: min_data_in_leaf=1500, n_estimators=170, and num_leaves=770. All others are set to their default values.

Training set augmentation

In order to have better visual comparison of cloud masks produced with our classifiers and MAJA, we randomly sampled one single 512x512 patch from each Sentinel-2 tile processed by MAJA that we have downloaded. We’ve calculated an intersection over union of the two masks and visually inspected those patches where the disagreement was found to be large. A pattern emerged: the disagreement was large on patches over water and bare land (sand, mountains and alike). In the former case, MAJA mask was often identified to be wrong, which is not surprising given the fact that MAJA algorithm is optimized for cloud classification over land and not water. In the case of bare land our classifier was wrong all the time.

Sentinel-2 image of clouds over water (left) and corresponding cloud masks as determined by MAJA (middle) and earlier versions of Sentinel Hub Cloud Detector (right). Clouds are black.

Sentinel-2 image of bare land with a small cloud in lover right area (left) and corresponding cloud masks as determined by MAJA (middle) and earlier versions of Sentinel Hub Cloud Detector (right). Clouds are black.

To circumvent the issue of systematic misclassification of bare land as clouds, we’ve augmented the training sample with misclassified pixels from around fifty handpicked 512x512 patches. Our final classifier performs much better on bare land than our earlier versions, but it still fails from time to time. We have also added pixels from our in-house labled 64x64 patches to the training sample of our final classifier.

Validation and comparison with other cloud detectors

Uff. Congratulations. You made it until the most interesting part. Here’s the reward for sticking around until the very end: the comparison of performance of popular cloud detection algorithms including our own on Sentinel-2 imagery:

Fmask: We used Python implementation (version 0.4.5) of the Fmask algorithm. We ran it with default settings at 20 meter resolution and fixed buffer size of 3 pixels.
Sen2Cor: We ran version 2.4.0 with default settings at 20 meter resolution. In the tables below a pixel is considered to be cloudy, if it is classified as cirrus or cloud with medium probability by the algorithm.
MAJA: We downloaded masks at 10 meter resolution.
Sentinel Hub: We ran our classifier on Sentinel-2 scene with 10 meter resolution, convolved the probability map with a disk with radius of 22 pixels and dilated for 11 pixels.

As mentioned above, we use Hollstein et al. hand labeled data set for validation and comparison. The results for single-scene algorithms on 108 hand labeled Sentinel-2 scenes with more than 6 million pixels are:

Cloud and cirrus cloud detection rates and land, water, snow and shadow misclassification rates as clouds as determined using 108 Sentinel-2 scenes hand labeled by Hollstein et al.

To the best of our knowledge, this is the first comparison of the most popular cloud masking algorithms on a large human labeled data set of Sentinel-2 imagery. As can be seen, our cloud detector developed for Sentinel Hub performs better than current state-of-the-art single-scene algorithms, yielding higher cloud detection rate while at the same time having much lower misclassification rate of land and snow as clouds. The misclassification of shadow as clouds is higher in the case of our algorithm on the account of dilation, which enlarges the cloudy regions. A cloud shadow is almost always next to a cloud in a satellite image therefore inflated cloudy regions will most often include cloud shadow regions as well. This is not really an issue since shadow masking usually follows cloud masking in a pre-processing chain.

The single-scene algorithms can be compared with the MAJA multi-temporal algorithm on a subset of Hollstein’s data set containing only 15 Sentinel-2 scenes from western Europe and northern Africa (see figure above). The results are:

Cloud and cirrus cloud detection rates and land, water, snow and shadow misclassification rates as clouds as determined using 15 Sentinel-2 scenes hand labeled by Hollstein et al.

Before we comment the above results, we would like to stress out again that this subset is clearly not representative of global land cover types and variations in climate. They should not be used to make any conclusions about the performance of all four classifiers on a global scale. The MAJA multi-temporal algorithm clearly outperforms single-scene algorithms in terms of land misclassification as clouds, while in the case of cloud and cirrus detection rates our classifier is better. The limitations of this subset are also visible in the fact that our classifier correctly identifies all cloudy pixels. It correctly identifies all cloudy pixels on this subset, but when subjected to a larger data set with more variation its cloud detection rate is no longer 100% (see the table with results using 108 Sentinel-2 scenes). The high misclassification rates for single-scene algorithms on this subset of 15 tiles compared to the misclassification rate shown above for 108 tiles can be explained by the fact that a large fraction of these 15 tiles include bare land with sand that looks very bright in images and is problematic to correctly classify by single-scene algorithms. It would be very interesting to have MAJA classification masks for all 108 tiles from Hollstein’s data set for a more fair comparison.

Sentinel-2 image of Abarrancamento recorded on 2017-12-15 via Sentinel Hub without and with the Sentinel Hub Cloud Detector.

Concluding remarks

Machine learning approach can give state-of-the-art results, if the training and validation sets are both of good enough quality and representative of the unseen data. Procurement of labeled samples in remote sensing suitable for development of models that perform well on a global scale is particularly challenging. We’re fortunate to be living on a planet with such versatile landscapes. But this versatility is a curse for machine learning. Models based on machine learning have, however, the ability to constantly evolve and improve. With the new labeled data sets that we will curate with the help of community and feedback we hope to get from our users, we are certain that the performance of our cloud detector can only improve. Stayed tuned for future developments!

Interested in using Sentinel Hub Cloud Detector?

Contact us if you’re interested in using Sentinel Hub Cloud Detector. We will wait a bit to get additional feedback on our findings that confirm or challange our results. Once we are confident in our algorithm, we plan to process the entire Sentinel-2 archive and provide cloud masks for our Sentinel Hub users as an additional layer to be used in Custom scripts.

Update (22 January 2018)

The Sentinel Hub Cloud Detector is available as python package s2cloudless. See

Sentinel Hub Cloud Detector — s2cloudless

With the sentinelhub-py library out in the open, we are happy to add another Python tool to help you untangle the value…