Images of an area at multiple time frames.

On cloud detection with multi-temporal data

A pair of data scientists’ experiences with their heads among the clouds

Jernej Puc
Oct 14, 2019 · 12 min read

by Jernej Puc and Lojze Žust

Our story begins with Sinergise entering the Cloud Masking Inter-comparison Exercise (CMIX), an international collaborative initiative, which is expected to contribute to a better understanding of the strengths and weaknesses of different algorithms. As cloud masking is crucial for retrieval of accurate surface reflectances within atmospheric correction processes, taking part in the initiative presented us with an opportunity to reexamine our existing methods and consider potential alternatives. Specifically, a multi-temporal processor had been long on our radar, and it was high time we brought it about.


s2cloudless appears to handle most situations properly.

However, there are certain scenarios, in which it misses the mark. Bright objects such as houses, dirt roads, beaches, desert sand, and snow, for example, are known to be potential sources of issues.

Examples of beaches, desert sand, and snow, which confuse the classifier.

Observing the reflectance values of misclassified pixels, they can appear so alike actual clouds that our current setup might be practically unable to distinguish between them. One way of tackling this problem would be to use additional context from the area surrounding the pixel, since one can usually discern between objects, such as houses and clouds, by their shape alone. A convolutional network might be apt for this approach, but can be too demanding for the amount of data that we are dealing with.

The second approach focuses on additional observations of the same geographic area but on different dates. Structures, which are frequently misclassified, can be expected to change very little over time. Therefore, observing an area over multiple time frames, one should be able to accurately perceive which cloud-like pixels are permanent (houses, beaches, etc.) and which are temporary (actual clouds). In one way or another, this is the intuition behind all multi-temporal cloud detectors.

An example of an area that is repeatedly misclassified.

The state of the art

MAJA (MACCS ATCOR Joint Algorithm) actually performs more than just cloud detection: it detects clouds and their shadows, estimates aerosol optical thickness and water vapour content, and corrects for atmospheric effects. It is a full-fledged atmospheric correction processor and has been applied to several satellites. However, it is admittedly complicated and not easy to use.

Instead of relying on specific mono and multi-temporal criteria as MAJA does, we opted for the machine learning approach and encountered a familiar problem: the lack of a large high quality labeled data set, on which we could train our algorithm.

With s2cloudless, we turned to the second-best option and used a collection of MAJA’s cloud masks as a proxy for ground truth. This time around, we had no choice but to do the same, although the question of sensibility thus became more prominent; since s2cloudless is itself mono-temporal, using a multi-temporal algorithm as a reference seemed much less of an issue.

Building a multi-temporal algorithm

Temporal reflectance description

Temporal statistics of Sentinel-2’s B01 spectral band.

Relative reflectance description

Statistics of differences in Sentinel-2’s B01 spectral band, relative to the target frame.

Correlation of pixel neighbourhoods

MAJA’s definition of correlation is related to the covariance⁴ and appears to be the same as the sample (Pearson) correlation coefficient:

Formula for the Pearson correlation coefficient: x and y refer to a spatial window at two different time frames.

Internally, we have experimented with the Pearson coefficient before, although with less promising results, and began to question whether it was even appropriate for our use case. While it can be used as a metric that “describes how well the two images X and Y correlate over an area of N by N pixels”, the correlation it observes is linear.

From Wikipedia: Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the non-linearity and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case, the correlation coefficient is undefined because the variance of Y is zero.

For image data, this means that a visible gradient in the cloud’s reflectance may cause the correlation to be misreported, or worse, if the cloud is uniform, variance and consequently correlation can be close to undefined. The Pearson coefficient should, thus, work as expected only in the case of clouds with sufficiently variable reflectance in the considered window (pixel neighbourhood) and as such might not carry the most meaningful information.

Having practically written it off after trying to get around these issues in vain, we happened to come upon the so-called structural similarity index (SSIM):

SSIM formula: μ refers to the average, σ² to the variance and σxy to the covariance of windows x and y. The computation of averages and variances should include the use of a Gaussian weighting function to avoid undesirable blocking artifacts⁵. The constants are introduced to stabilise the division with a small denominator.

The equation for computing SSIM includes the same components that are (implicitly) used with the Pearson coefficient, which allowed us to reuse much of our previous work and optimisations. And while SSIM was also not intended for our use case, its ability to compare two images based on their luminance, contrast, and structure, in particular, appeared promising in the context of discussion.

Comparing a cloudy image of mountains with a clear reference, the Pearson coefficient unexpectedly tends to correlate clouds and mountain peaks, as they happen to display a similar linear gradient. Fortunately, SSIM appears less susceptible to such occurrences and functions similarly, but more in line with our expectations.

Feature importances

After some experimentation, we have decided on the final selection of features and conducted our training with the same 10 Sentinel-2 spectral bands that s2cloudless uses, as alluded to by the graph below:

Feature importances based on the number of splits for each selected feature. Basing the importances on information gain would put too much emphasis on the initial split(s), creating an unbalanced and unreadable graph, but can be meaningful in its own right (for example, it implied that the maximum difference in the B01 spectral band is the best single rule of thumb you can make for detecting a cloud).

The shown feature importances lead us to the following observations:

  • Visible peaks occur periodically at higher and lower wavelengths, which appear to be the most descriptive in the case of clouds.
  • Among the most important features are window reflectance averages, which should carry the same information as pixel reflectance values, only without the noise.
  • We expected the standard deviation of structural similarity indices to be highly variable and therefore a poor descriptor in separating classes. In contrast to our intuition, it turned out moderately prominent.


We resorted to using the data set curated by Hollstein et al.¹, which includes samples that belong to one of 6 different classes: cloud, cirrus, land, water, snow, and shadow. This is useful as it can highlight where the algorithm is failing or performing particularly well. It should be noted that the samples in Hollstein’s data set are part of only several thousand polygons, hand-drawn over mostly uniform objects. What this means is that it effectively covers only a handful of regions and the achieved results may not generalise as expected.

Comparison of confusion matrices, which represent the percentages of different labeled pixels that were classified as one class or the other during validation. Note: Our results for s2cloudless slightly deviate from when it was validated before due to a different processing resolution.
Comparison of performance on the validation set according to different metrics.

While the difference in behaviour was noticeable, the SSIM-based classifier sacrificed too many cirrus detections (or thin clouds in general) in order to dismiss comparatively fewer misclassifications. This trade-off is a common problem with multi-temporal cloud detectors.

Nonetheless, we were confident that our results can be sufficiently improved.

Improving the algorithm

Separation of clouds and snow

After confirming the validation results above, we decided to address the poor separation of clouds and snow. We were familiar with a certain region, consisting of mountain ranges, that was problematic in this regard, for which we then had to find enough suitable examples — images that were completely cloud-free beyond a reasonable doubt — and include them in our data set. This can be a time-consuming process, even with the help of a powerful tool such as the eo-browser, which proved useful in the task.

Additional examples noticeably altered the algorithm’s behaviour:

Comparison of confusion matrices.
Comparison of performance on the validation set.

Despite observing a general improvement after including our new data, we were pressed for time and had to consider other means for further refinement.

Intersection of predictions

A closer inspection indicated that s2cloudless is able to properly detect the majority of actual clouds, both thick and thin, and that its shortcomings mostly stem from relatively frequent false positives (misclassifications of non-clouds). This meant that we could use a simple intersection of predictions: practically all of the legitimate clouds, detected by the multi-temporal classifier, should be detected by s2cloudless as well, while false positives can be rejected due to either of the classifiers.

The intersection of predictions is implemented as a logical AND operation between binary masks produced by thresholding the output probabilities of both classifiers. Since the thresholds can be individually set, their combinations can produce observably distinct results, which became apparent during validation. The thresholds that were chosen to produce the results below were set to acceptably reduce the false positive rate, specifically misclassifications of land and shadows, but are not optimal in terms of overall accuracy, as we carried the risk of overfitting.

Comparison of confusion matrices, where InterSSIM stands for the results obtained via intersection of predictions and relies on the SSIM-based classifier, which had hard negative data added to its training data set.
Comparison of performance on the validation set.

It is also revealing to take a look at the ROC space, plotting the true positive rate (TPR) against the false positive rate (FPR). It can be observed that by combining the predictions and properly adjusting their thresholds, we have been able to improve the TPR of the augmented SSIM-based classifier while achieving an even lower FPR.

The ROC space in full (left) and up-close (right). The best performers should be leading the upper-left front, as high TPR and low FPR are desired

By setting the thresholds in favour of s2cloudless, the TPR should increase, which thus allows an additional degree of flexibility regarding the true and false positive rates, depending on which is more desirable in a given situation. In any case, successful results would show InterSSIM mostly performing better at least in terms of the latter, which was our original intention.

Ratios of detections: brighter pixels indicate higher ratios, i.e. that clouds were often detected, while darker areas correspond to fewer detections overall.

Practical results

It can be observed that our multi-temporal algorithm indeed reduced the rate of false positive detections. The SSIM-based classifier on its own is not presumed to work better than s2cloudless in every case, but appears to be pulling most of the weight in the above examples.

Conclusions and the road ahead

Indeed, certain concerns remain unaddressed. In our past and present work, features were limited by what had seemed sensible to us and we can take some comfort in being able to rationalise around and elaborate on our choices.

However, perhaps we need not have made them at all: by designing a single end-to-end model, which could learn to extract the most important features on its own and use those to train its later stages, both at once, we could have avoided this step entirely. We are confident that our research will eventually continue but perhaps based on a different architecture.

Do you have an idea of your own? If you are interested in our work or Earth observation in general, you can join us and contribute to our research endeavours.


[2]: Olivier Hagolle, Mireille Huc, Camille Desjardins, Stefan Auer, & Rudolf Richter. (2017, December 7). MAJA Algorithm Theoretical Basis Document (Version 1.0). Zenodo.

[3]: Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 3149–3157.

[4]: Lyapustin, A., Wang, Y., & Frey, R. (2008). An automatic cloud mask algorithm based on time series of MODIS measurements. Journal of Geophysical Research, 113, D16207.

[5]: Zhou Wang, Member, Alan Conrad Bovik, Hamid Rahim Sheikh, Eero P. Simoncelli. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, vol. 13, no. 4, April 2004.

Sentinel Hub Blog

Stories from the next generation satellite imagery platform

Thanks to Anze Zupanc, Matic Lubej, and EO Research

Jernej Puc

Written by

Student of mathematics and computer science

Sentinel Hub Blog

Stories from the next generation satellite imagery platform

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade