Cloud Segmentation in Landsat-8 Images
A guest blog post by Ingrid Grenet
Foreword by Sentinel Hub
This post is part of a series of guest blog posts written by script authors, talking about their entries to the Sentinel Hub Custom Script Contest. Ingrid Grenet and Houssem Farhat are among the winners in the third round of the Contest. Their winning script with detailed description is available on our GitHub repository.
Context and motivations
When collecting images from Earth Observation (EO) satellites, it is sometimes hard to exploit them due to the cloud coverage. This »cloud contamination« is a well known problem for landscape studies and ways of identifying clouds are welcomed by the community .
Therefore, a quick and automatic computing of this coverage in each new collected image could be useful to assess if this observation can be further used for any task. Moreover, detecting clouds on-board satellites enables controlling on-board data compression for optimizing satellite storage. The script presented here illustrates a solution for this problem and has been developed as part of the CIAR (Chaine Image Auto Réactive) project from Saint-Exupéry Institute of Technological Research.
Objective of the script
The goal of the script is to distinguish clouds from any type of land (water, land, snow, etc.) in Earth images from the Landsat-8 satellite. It is the direct application of a machine learning model obtained with a proprietary Evolutionary Algorithm developed by the French company MyDataModels and part of the core engine of the WebApp called TADA.
In particular, the model returned by the algorithm performs segmentation by classifying each individual pixel of images into cloud or no cloud classes, only based on its spectral information. Indeed, the input features of the model are the intensity values measured for the nine bands of the electromagnetic spectrum of the Landsat-8 satellite sensor OLI (Operational Land Imager) and available in the Sentinel Hub EO Browser application. A summary of the 9 bands is available in the Table below with their respective name and wavelength.
Evolutionary Algorithm description
The Evolutionary Algorithm used within TADA is a Genetic Programming approach for Symbolic Regression. It has its own specificity, especially in the way of generating individuals, resulting in mathematical formulae combining features and constants. As in every evolutionary algorithm, these individuals evolve thanks to genetic operators such as mutation and crossover which are also non standard. For more details about this algorithm, see .
The model resulting from the algorithm finally corresponds to the best individual ever encountered during the evolutionary process and is provided as a simple mathematical formula combining some of the input features. Basically, these features have been selected as the most important to explain the considered output. One interesting advantage of this formula is its high interpretability. Moreover, it can be easily deployed and implemented on-board systems.
One interesting advantage of this formula is its high interpretability. Moreover, it can be easily deployed and implemented on-board systems.
In the case of a binary classification task, two formulae corresponding to the two classes are returned by the algorithm. These two formulae are then computed for each new observation and the class related to the formula with the highest value is assigned to the observation.
Model training and results
In this specific case, the evolutionary algorithm within TADA has been trained for a binary classification task, on a subset of around one hundred 16-bits images from the Landsat-8 database. The dataset was composed of 10,000 observations (i.e pixels) randomly taken from the 100 images and equally divided between the two classes. Each pixel is described by nine values corresponding to the nine electromagnetic bands of the Landsat-8 sensor.
The resulting model has good performance with accuracy, sensitivity and specificity around 0,89. The model has selected only three bands to be relevant for the classification between cloud and no cloud : Coastal/Aerosol, Red and SWIR2. The two formulas corresponding to the two classes are available below and highlight the model’s simplicity. Indeed, if the first formula seems quite long, the second one only uses the value of the SWIR2 band. According to the first formula we can guess that there are some interactions between the bands SWIR2 and Coastal/Aerosol and that the bands Coastal/Aerosol and Red seems to be the most important variables.
Formula of the cloud class:
2.162–0.796*Red + 0.972*sqrt(abs(0.0287*SWIR2*Coastal/Aerosol + 0.971*sin(Coastal/Aerosol))) + 0.024*floor(0.996*sqrt(abs(0.029*SWIR2*Coastal/Aerosol + 0.971*sin(Coastal/Aerosol))) + 0.005*abs(0.029*SWIR2*Coastal/Aerosol + 0.971*sin(Coastal/Aerosol))) — 0.180*cos(Red) + 0.005*abs(0.029*SWIR2*Coastal/Aerosol + 0.971*sin(Coastal/Aerosol))
Formula of the no cloud class:
The script developed for the contest enables the application of the model to every image of the Landsat-8 database. For each pixel of the image, it works as follows. First, it transforms the values of the pixel for the three selected bands into 16-bits values. Since the training of the model has been performed on 16-bits images (i.e values range from 0 to 65,535) and the values available in EO Browser range from 0 to 1, it is necessary to multiply them by 65,535 before computing the formulae. Then, the two formulae are computed and compared to assign a class to the pixel. The pixel finally appears in white if it has been classified as cloud and in RGB (or black) otherwise. Note that the output colors can be changed in the beginning of the script according to the user’s wish.
Below are two examples of the result of the application of the script to two Landsat-8 images. For each example, the first image corresponds to the original one, the second is the result of the script where pixels predicted as clouds appear in white. The third image is also the result of the segmentation where black corresponds to the no cloud class, enabling better visualization of the accuracy of the segmentation.
We can see from these examples that the model is able to detect all types of clouds, even the thinner ones. It sometimes misses some thin clouds and clouds borders, and does not always distinguish between white clouds and snow, which is a hard task. Actually, another model based on thermal infrared bands was developed in parallel, performing multi classification for three classes: clouds, snow and other types of land (i.e: land, water, etc), and is able to clearly differentiate snow and clouds with an accuracy greater than 0.93. However, it can not be used in the EO Browser as the values and units used for the thermal infrared bands to train our model are not the same than provided by the EO Browser.
The script presented here is the application of a machine learning model that performs segmentation of clouds in remote sensing images. It is the result of a particular Evolutionary Algorithm available in TADA WebApp. It highlights the main advantages of the algorithm among which its frugality (only 10,000 pixels for training), its interpretability thanks to simple formulas and its hardware portability.
This use case is only one among others since we can imagine developing the same type of script to identify other kinds of lands such as forest, water, etc. Note that the algorithm is also able to perform regression and multi-class classification.
This script was developed as part of the CIAR project from Saint-Exupéry Institute of Technological Research. This project involves the following academic and industrial partners: ActiveEon, Avisto, Elsys Design, GEO4i, Inria, LEAT/CNRS, MyDataModels, Thales Alenia Space and TwinswHeel.
 R. Irish, J. Barker, S. Goward, and T. Arvidson. Characterization of the Landsat-7 ETM+ Automated Cloud-Cover Assessment (ACCA) Algorithm. Photogrammetric engineering and remote sensing 72, no. 10: 1179–1188, 2006.
 M. Joseph Hughes and Daniel Hayes. Automated detection
of cloud and cloud shadow in single-date landsat imagery
using neural networks and spatial post-processing. Remote
Sensing, 6:4907–4926, 2014.
 A. Boisbunon, C. Fanara, I. Grenet, J. Daeden, A. Vighi, and M. Schoe-nauer, Zoetrope genetic programming for regression. arXiv:2102.13388, 2021.
The Sentinel Hub team would like to thank Ingrid and Houssem for their participation in our Contest.
To learn more about satellite imagery and custom scripting we recommend you to check the Sentinel Hub Educational page and Custom Scripts webinar. You can also visit a dedicated topic at the Sentinel Hub Forum for further information.
We would also like to invite you to take a look at the other scripts submitted to the Sentinel Hub Custom Script Contests, available here. Stay tuned, the next round of the Contest will follow soon.