Sentinel Hub Blog
Published in

Sentinel Hub Blog

Images of an area at multiple time frames.

On cloud detection with multi-temporal data

A pair of data scientists’ experiences with their heads among the clouds

by Jernej Puc and Lojze Žust

Our story begins with Sinergise entering the Cloud Masking Inter-comparison Exercise (CMIX), an international collaborative initiative, which is expected to contribute to a better understanding of the strengths and weaknesses of different algorithms. As cloud masking is crucial for retrieval of accurate surface reflectances within atmospheric correction processes, taking part in the initiative presented us with an opportunity to reexamine our existing methods and consider potential alternatives. Specifically, a multi-temporal processor had been long on our radar, and it was high time we brought it about.


At Sinergise, we make extensive use of s2cloudless, our single-scene (mono-temporal) cloud detector. Publicly available since January 2018, it was developed as a fast machine learning algorithm, which could detect clouds in near real-time and give state-of-the-art results among the competition. Indeed, it performed well in validation on hand-labeled data provided by Hollstein et al.¹, as demonstrated in the original blog post, which covers the process of s2cloudless’ development.

s2cloudless appears to handle most situations properly.

However, there are certain scenarios, in which it misses the mark. Bright objects such as houses, dirt roads, beaches, desert sand, and snow, for example, are known to be potential sources of issues.

Examples of beaches, desert sand, and snow, which confuse the classifier.

Observing the reflectance values of misclassified pixels, they can appear so alike actual clouds that our current setup might be practically unable to distinguish between them. One way of tackling this problem would be to use additional context from the area surrounding the pixel, since one can usually discern between objects, such as houses and clouds, by their shape alone. A convolutional network might be apt for this approach, but can be too demanding for the amount of data that we are dealing with.

The second approach focuses on additional observations of the same geographic area but on different dates. Structures, which are frequently misclassified, can be expected to change very little over time. Therefore, observing an area over multiple time frames, one should be able to accurately perceive which cloud-like pixels are permanent (houses, beaches, etc.) and which are temporary (actual clouds). In one way or another, this is the intuition behind all multi-temporal cloud detectors.

An example of an area that is repeatedly misclassified.

The state of the art

Since the original blog post was published, there were still only a few algorithms that relied on multi-temporal data for cloud detection. Of those, we regarded MAJA² as the most accurate, based on past experiences.

MAJA (MACCS ATCOR Joint Algorithm) actually performs more than just cloud detection: it detects clouds and their shadows, estimates aerosol optical thickness and water vapour content, and corrects for atmospheric effects. It is a full-fledged atmospheric correction processor and has been applied to several satellites. However, it is admittedly complicated and not easy to use.

Instead of relying on specific mono and multi-temporal criteria as MAJA does, we opted for the machine learning approach and encountered a familiar problem: the lack of a large high quality labeled data set, on which we could train our algorithm.

With s2cloudless, we turned to the second-best option and used a collection of MAJA’s cloud masks as a proxy for ground truth. This time around, we had no choice but to do the same, although the question of sensibility thus became more prominent; since s2cloudless is itself mono-temporal, using a multi-temporal algorithm as a reference seemed much less of an issue.

Building a multi-temporal algorithm

After assembling our data set, we considered the design of s2cloudless as a starting point and decided to train a classifier based on the LightGBM³ model architecture as well. Relying merely on reflectance values of selected spectral bands was not an option in our case, as we needed to handle variability in the number of time frames of the input, which should not be expected to be consistent. Instead, additional multi-temporal features had to be extracted with some sort of statistical feature aggregation.

Temporal reflectance description

The simplest step in this regard was to describe the temporal distribution of reflectance values with their average and extremes. Clouds tend to be more reflective than the surfaces they cover, hence taking the maximum might not be very useful, as it should pool the clouds from all of the time frames into a single cloudy image. The minimum and mean, on the other hand, ought to help approximate a cloudless image, which is ideal for comparison with the target frame. Both values can be deceiving, however: the image of minima can be marred by shadows and the mean is highly dependant on (the lack of) cloud presence in the frames.

Temporal statistics of Sentinel-2’s B01 spectral band.

Relative reflectance description

Comparison of these temporal statistics with the target frame is directly related to the statistics of the differences in reflectance values. For example, the maximum difference between the target frame and every other is simply the difference between the target frame and the timewise minimum image, excluding the target frame itself. Adding these statistics to the features might, therefore, be seen as redundant, yet even such slight pre-processing can be of aid to the classifier.

Statistics of differences in Sentinel-2’s B01 spectral band, relative to the target frame.

Correlation of pixel neighbourhoods

Two different clouds over the same location on successive dates are unlikely to be of the same shape. Comparing the pixel neighbourhoods of multiple time frames should be informative in this regard and is in fact used in MAJA already. After a sequence of criteria, MAJA computes the correlation of neighbourhoods for each potential detection and if a large enough correlation is observed, the detection is rejected.

MAJA’s definition of correlation is related to the covariance⁴ and appears to be the same as the sample (Pearson) correlation coefficient:

Formula for the Pearson correlation coefficient: x and y refer to a spatial window at two different time frames.

Internally, we have experimented with the Pearson coefficient before, although with less promising results, and began to question whether it was even appropriate for our use case. While it can be used as a metric that “describes how well the two images X and Y correlate over an area of N by N pixels”, the correlation it observes is linear.

From Wikipedia: Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the non-linearity and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case, the correlation coefficient is undefined because the variance of Y is zero.

For image data, this means that a visible gradient in the cloud’s reflectance may cause the correlation to be misreported, or worse, if the cloud is uniform, variance and consequently correlation can be close to undefined. The Pearson coefficient should, thus, work as expected only in the case of clouds with sufficiently variable reflectance in the considered window (pixel neighbourhood) and as such might not carry the most meaningful information.

Having practically written it off after trying to get around these issues in vain, we happened to come upon the so-called structural similarity index (SSIM):

SSIM formula: μ refers to the average, σ² to the variance and σxy to the covariance of windows x and y. The computation of averages and variances should include the use of a Gaussian weighting function to avoid undesirable blocking artifacts⁵. The constants are introduced to stabilise the division with a small denominator.

The equation for computing SSIM includes the same components that are (implicitly) used with the Pearson coefficient, which allowed us to reuse much of our previous work and optimisations. And while SSIM was also not intended for our use case, its ability to compare two images based on their luminance, contrast, and structure, in particular, appeared promising in the context of discussion.

Comparing a cloudy image of mountains with a clear reference, the Pearson coefficient unexpectedly tends to correlate clouds and mountain peaks, as they happen to display a similar linear gradient. Fortunately, SSIM appears less susceptible to such occurrences and functions similarly, but more in line with our expectations.

Feature importances

We approached the selection of the most significant features by examining LightGBM’s feature importance attribute. Training a LightGBM model involves the creation of many simple decision trees by incrementally splitting the branches of each according to a feature condition that provides the greatest information gain. The number of splits or the cumulative information gain is tracked for each feature, producing the feature importances.

After some experimentation, we have decided on the final selection of features and conducted our training with the same 10 Sentinel-2 spectral bands that s2cloudless uses, as alluded to by the graph below:

Feature importances based on the number of splits for each selected feature. Basing the importances on information gain would put too much emphasis on the initial split(s), creating an unbalanced and unreadable graph, but can be meaningful in its own right (for example, it implied that the maximum difference in the B01 spectral band is the best single rule of thumb you can make for detecting a cloud).

The shown feature importances lead us to the following observations:

  • Visible peaks occur periodically at higher and lower wavelengths, which appear to be the most descriptive in the case of clouds.
  • Among the most important features are window reflectance averages, which should carry the same information as pixel reflectance values, only without the noise.
  • We expected the standard deviation of structural similarity indices to be highly variable and therefore a poor descriptor in separating classes. In contrast to our intuition, it turned out moderately prominent.


To properly asses the performance of an algorithm, one needs to be confident that the data, which is to be used for validation, is labeled correctly. In this case, quality takes precedence over quantity, which points to manually labeled data, however scarce.

We resorted to using the data set curated by Hollstein et al.¹, which includes samples that belong to one of 6 different classes: cloud, cirrus, land, water, snow, and shadow. This is useful as it can highlight where the algorithm is failing or performing particularly well. It should be noted that the samples in Hollstein’s data set are part of only several thousand polygons, hand-drawn over mostly uniform objects. What this means is that it effectively covers only a handful of regions and the achieved results may not generalise as expected.

Comparison of confusion matrices, which represent the percentages of different labeled pixels that were classified as one class or the other during validation. Note: Our results for s2cloudless slightly deviate from when it was validated before due to a different processing resolution.
Comparison of performance on the validation set according to different metrics.

While the difference in behaviour was noticeable, the SSIM-based classifier sacrificed too many cirrus detections (or thin clouds in general) in order to dismiss comparatively fewer misclassifications. This trade-off is a common problem with multi-temporal cloud detectors.

Nonetheless, we were confident that our results can be sufficiently improved.

Improving the algorithm

The quality of the training data set plays a very crucial role in determining the performance of a classifier. We may have avoided having an insufficient number of samples and an irregular representation of diverse geographic areas, but using machine annotated data has its trade-offs. Inevitable noise in the data can perhaps be discarded, however, certain biases of the original algorithm (MAJA) that tend to consistently mislabel the samples are bound to affect the training and manifest in our results as well.

Separation of clouds and snow

A common technique that can help reduce such effects is known as hard negative mining. It means supplying the data set with hand-annotated data, either by adding new, specific examples, or relabeling existing ones if they are determined to be incorrect through visual inspection.

After confirming the validation results above, we decided to address the poor separation of clouds and snow. We were familiar with a certain region, consisting of mountain ranges, that was problematic in this regard, for which we then had to find enough suitable examples — images that were completely cloud-free beyond a reasonable doubt — and include them in our data set. This can be a time-consuming process, even with the help of a powerful tool such as the eo-browser, which proved useful in the task.

Additional examples noticeably altered the algorithm’s behaviour:

Comparison of confusion matrices.
Comparison of performance on the validation set.

Despite observing a general improvement after including our new data, we were pressed for time and had to consider other means for further refinement.

Intersection of predictions

During development, s2cloudless itself made some use of hard negative mining and it resulted in a well-rounded classifier that occasionally performs better than the above multi-temporal model. We were reluctant to let go of positive results, produced by these previous efforts, by switching over to an algorithm that is, in fact, more complex. For that reason, we considered tapping directly into s2cloudless’ predictive capabilities.

A closer inspection indicated that s2cloudless is able to properly detect the majority of actual clouds, both thick and thin, and that its shortcomings mostly stem from relatively frequent false positives (misclassifications of non-clouds). This meant that we could use a simple intersection of predictions: practically all of the legitimate clouds, detected by the multi-temporal classifier, should be detected by s2cloudless as well, while false positives can be rejected due to either of the classifiers.

The intersection of predictions is implemented as a logical AND operation between binary masks produced by thresholding the output probabilities of both classifiers. Since the thresholds can be individually set, their combinations can produce observably distinct results, which became apparent during validation. The thresholds that were chosen to produce the results below were set to acceptably reduce the false positive rate, specifically misclassifications of land and shadows, but are not optimal in terms of overall accuracy, as we carried the risk of overfitting.

Comparison of confusion matrices, where InterSSIM stands for the results obtained via intersection of predictions and relies on the SSIM-based classifier, which had hard negative data added to its training data set.
Comparison of performance on the validation set.

It is also revealing to take a look at the ROC space, plotting the true positive rate (TPR) against the false positive rate (FPR). It can be observed that by combining the predictions and properly adjusting their thresholds, we have been able to improve the TPR of the augmented SSIM-based classifier while achieving an even lower FPR.

The ROC space in full (left) and up-close (right). The best performers should be leading the upper-left front, as high TPR and low FPR are desired

By setting the thresholds in favour of s2cloudless, the TPR should increase, which thus allows an additional degree of flexibility regarding the true and false positive rates, depending on which is more desirable in a given situation. In any case, successful results would show InterSSIM mostly performing better at least in terms of the latter, which was our original intention.

Ratios of detections: brighter pixels indicate higher ratios, i.e. that clouds were often detected, while darker areas correspond to fewer detections overall.

Practical results

The figure on the side displays the ratios of cloud detections within a year for 3 selected patches from Turkey. In these examples, the clouds are expected to be practically random and therefore should not accumulate in any particular part of the image. As such, one would expect to see an image of ratios that is mostly uniform. Visible structures, on the other hand, are probably caused by undesired behaviour and indicate that the algorithm misclassifies them as clouds at a perceptibly high rate.

It can be observed that our multi-temporal algorithm indeed reduced the rate of false positive detections. The SSIM-based classifier on its own is not presumed to work better than s2cloudless in every case, but appears to be pulling most of the weight in the above examples.

Conclusions and the road ahead

With InterSSIM, we have succeeded in delivering a multi-temporal alternative that fares better than our existing cloud detector, s2cloudless, according to the available metrics. As of the time of writing, it has been integrated into eo-learn and thus made publicly available.

Indeed, certain concerns remain unaddressed. In our past and present work, features were limited by what had seemed sensible to us and we can take some comfort in being able to rationalise around and elaborate on our choices.

However, perhaps we need not have made them at all: by designing a single end-to-end model, which could learn to extract the most important features on its own and use those to train its later stages, both at once, we could have avoided this step entirely. We are confident that our research will eventually continue but perhaps based on a different architecture.

Do you have an idea of your own? If you are interested in our work or Earth observation in general, you can join us and contribute to our research endeavours.


[1]: Hollstein, A.; Segl, K.; Guanter, L.; Brell, M.; Enesco, M. Ready-to-Use Methods for the Detection of Clouds, Cirrus, Snow, Shadow, Water and Clear Sky Pixels in Sentinel-2 MSI Images. Remote Sens. 2016, 8, 666.

[2]: Olivier Hagolle, Mireille Huc, Camille Desjardins, Stefan Auer, & Rudolf Richter. (2017, December 7). MAJA Algorithm Theoretical Basis Document (Version 1.0). Zenodo.

[3]: Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 3149–3157.

[4]: Lyapustin, A., Wang, Y., & Frey, R. (2008). An automatic cloud mask algorithm based on time series of MODIS measurements. Journal of Geophysical Research, 113, D16207.

[5]: Zhou Wang, Member, Alan Conrad Bovik, Hamid Rahim Sheikh, Eero P. Simoncelli. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, vol. 13, no. 4, April 2004.




Stories from the next generation satellite imagery platform

Recommended from Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jernej Puc

Jernej Puc

Student of mathematics and computer science

More from Medium

Introducing the ArangoDB-NetworkX Adapter