Land Cover Classification with eo-learn: Part 2

Going from Data to Predictions in the Comfort of Your Laptop

A temporal stack of Sentinel-2 images of a small area in Slovenia, followed by a land cover prediction, obtained via methods presented in this post.


The second part of the blog series on land use and land cover classification with eo-learn is out! This part picks up from the first one, where we presented a basic approach towards the following:

In addition, we explored the data in order to be more familiar with it, which is very important before actually diving into the machine learning (ML) part. The above tasks were accompanied by an example in the Jupyter Notebook format, which has now been updated to account for the material presented in this blog. We strongly suggest that you take another look at the first blog post and refresh your memory of what has been covered there.

In this part, we focus on the remainder of the data preparation, as well as the training of a machine learning classifier and making some predictions of the land cover in Slovenia for the year 2017.

Preparing the Data

The actual machine learning part in the whole processing pipeline is very small compared to the whole workflow. The majority of the work consists of data cleaning, data manipulation, and data reshaping, in order to obtain seamless usage with a machine learning classifier. The data preparation part is done in the steps described below.

Diagram of a machine learning pipeline, showing that the ML code actually represents a relatively small part of the ML pipeline. Source: Sculley et al. Hidden Technical Debt in Machine Learning Systems, NIPS 2015

Cloudy Scene Filtering

Clouds are objects which generally occur at scales larger than our average EOPatch (1000 x 1000 px at 10 m resolution). This means that a randomly observed area can be completely covered by clouds on certain dates. Such frames contain no valuable information and only consume resources, so we discard them by calculating the ratio of valid pixels and setting a threshold. Valid pixels are all pixels which are not classified as clouds and are located inside the satellite swath (what the satellite “sees”). Note that we do not use satellite scenes’ cloud coverage meta-data since those are calculated on the tile level (S2 tile size is 10980 x 10980 pixels, around 110 x 110 km²) and are thus mostly irrelevant for the specific AOI. For the cloud detection, we use the s2cloudless algorithm to get actual cloudy pixels.

In our example notebook, the threshold is set to 0.8, so only time frames with a valid coverage ratio larger than 80 % are selected. This may seem like a pretty high threshold, but since cloud coverage is not so problematic in our selected area of interest, we can afford to do this. However, this step does not generalise well for all locations, as some of them can be cloudy for the better part of the year.

Temporal Interpolation of Pixels

Due to non-constant acquisition dates of the satellites and irregular weather conditions, missing data is very common in the field of EO. One way to tackle this problem is to apply the mask of valid pixels (defined in the previous step) and interpolate the values in order to “fill the gaps”. After the interpolation process, pixel values at uniform or non-uniform dates can be evaluated to unify the dates among all the EOPatches. In this example, we apply the linear interpolation method, but other methods are also applicable and already implemented in eo-learn.

A visual representation of a temporal stack of Sentinel-2 images over a randomly selected area. The transparent pixels on the left imply missing data due to cloud coverage. The stack on the right represents the pixel values after temporal interpolation, taking cloud masks into account.

Temporal information is very important in land cover classification and even more so in crop type classification. This is due to the fact that a lot of information about the land cover is embedded in the temporal evolution of the land. For example, looking at interpolated values of the NDVI, one can see in the image below that the values for the forest and cultivated land type peak in spring/summer and drop in autumn/winter, while for the water and artificial surface type these values remain mostly unchanged throughout the year. The artificial surface type has slightly larger values of NDVI and, to some extent, mimics the temporal evolution of vegetated lands due to the fact that urban areas often consist of trees and small parks. Additionally, due to the resolution effects, pixel values can consist of combined spectral responses from several different land cover types.

Temporal evolution of NDVI values for pixels of selected land cover types through the year.

Negative Buffer Application

While the 10 m resolution of the Sentinel-2 satellites is good enough for a wide spectrum of applications, the effects of objects smaller than that are noticeable. Such examples are borders between different types of land cover, where the values of the pixels contain signatures from both neighbouring types but are assigned to only one of the two classes in the reference map. This introduces noise when training the classifier and lowers the accuracy of the classifier in the prediction step. Additionally, road networks or other objects with a width of 1 px, which are defined in the reference map, are not easily recognisable from the image data. We apply a negative buffer of the size 1 px to the reference map, virtually eliminating the majority of such cases. The reference map used here is the same one as introduced in the first blog post. A part of it can be seen in the image below for the case before/after applying the negative buffer.

Reference map for a small part of the AOI before (left) and after (right) the application of the negative buffer on the map.

Random Subset Selection

As mentioned in the previous blog, the area of interest (AOI) is split into about 300 EOPatches, where each patch consists of about 1 million pixels. That is a whole lot of pixels, too much to account for all of them, so we uniformly sample about 40 000 px per EOPatch in order to obtain a dataset of about 12 million entries. Since the pixels are uniformly sampled, a lot of them also contain the “no data” label, where the information about the ground truth is not available. Such entries are of no use and are removed from the training set in order to not confuse the classifier and not teach it to predict the “no data” label. We repeat this also for the validation set because such entries unjustifiably lower the accuracy and other similar scores of the classifier in the validation step.

Splitting and Reshaping the Data

We make the train/test split at 80/20 % on the EOPatch level, meaning that pixels from the EOPatches used for training were not used for testing, and vice versa. The pixels from training EOPatches were further split into training and cross-validation sets in the same manner. Once split, we have a numpy.ndarray object of the shape (p, t, w, h, d), where p is the number of EOPatches in the subset, t is the number of resampled time frames for each EOPatch, and w, h, and d are the width, height, and depth of the EOPatches, respectively. After the subset selection, the width w corresponds to the number of selected pixels (e.g. 40 000 as above), while the height dimension h is equal to 1. The difference in the array shape does not change anything, it simply reorders the pixels so they are easier to use. The bands and band-like features d at all time frames t represent the classifier inputs used for the training, where there are p*w*h of such input entries. In order to transform the data into a form that is accepted by the classifier, we must reshape it from the 5D array to a 2D array of the form (p*w*h, d*t), which can be done with the following numpy magic:

Such a procedure then allows to make a prediction on new data of the same shape, and then transform it back into patches of the original shape, which can be again visualised with standard plotting techniques.

Machine Learning Model Construction

The optimal choice of a classifier heavily depends on the application, and even then there are a number of parameters of the model that need to be tuned for a specific application as well. In order to optimise these choices, one needs to, in principle, perform several experiments and compare the results.

In our approach to the classification, we use the LightGBM package, since it offers an intuitive, fast, distributed, and a high-performance gradient boosting framework based on decision tree algorithms. For tuning the parameters of the classifier, various approaches such as a grid search or a random search can be pursued and evaluated on the test dataset. For the purposes of this example, we will skip these procedures and use the default parameters.

Schematic of the decision trees in the LightGBM framework. Source:

The implementation of the model is very simple, and since the data is already shaped into the proper 2D array format, we just feed it to the model and wait. Congratz! Now you can tell everybody that you’re doing machine learning and you’ll be the coolest nerd at the nerd party, while your mum is concerned about rogue artificial intelligence and machine uprising.

Model Validation

Training machine learning models is easy. The hard part is training them well. In order to do so, we need an appropriate algorithm, a reliable reference map, and enough of CPU/memory resources. Even then the results might not be what you hope for, so the validation of the classifier by calculating confusion matrices and other validation metrics is absolutely crucial if you want to add credibility to your work.

Confusion Matrix

Confusion matrices are the first thing to look at in order to evaluate the model performance. They show the number of correctly and incorrectly predicted labels for each label from the reference map or vice versa. Usually, a normalised confusion matrix is shown, where all the values in the rows are divided by their collective sum. This shows if the classifier has a bias towards wrongly classifying certain types of land cover.

Two aspects of viewing the normalised confusion matrix of a trained model.

For most of the classes, the model seems to perform well. Problems occur for some of the classes due to the unbalanced training set. We see that the problematic cases are, for example, shrubland and water, where true pixels, belonging to these classes, can be misidentified as other classes. On the other hand, what is predicted as shrubland or water is in good agreement with the reference map. In the image below we notice that the problems generally occur for the under-represented classes, so one should keep in mind that these statements are only superficial, since the training sample in this example is relatively small and serves only as a proof of principle.

Frequency of pixels for each class in the training dataset. In general, the distribution is not uniform.

Receiver Operating Characteristic — ROC Curve

Classifiers predict the labels with a certain confidence, but the threshold for the prediction of a certain label can be manipulated. The ROC curve shows the diagnostic ability of a classifier as its discrimination threshold is varied. Usually, it is shown for a binary system, but it can also be used in multiclass classifications, where we calculate the “one vs. rest” characteristics for each represented class. The x-axis shows the false positive rate (we want it to be small), and the y-axis shows the true positive rate (we want it to be large) at different thresholds. Good classifier performance is characterised by a curve, which has a large integral value, also known as the area under curve (AUC). The same conclusion about the underrepresented classes based on the plot below can be made for shrubland, while the ROC curve for water looks much better, because water is more easily distinguishable, even if underrepresented.

ROC curves of the classifier, represented as “one vs. rest” for each class in the dataset. Numbers in brackets are the area-under-curve (AUC) values.

Feature Importance

A deeper insight into the intricate workings of the classifier can be obtained by looking at the feature importance, which tells us which features contributed the most to the final results. Some machine learning algorithms, like the one we use in this example, return these values as an output, while for others, these values have to be calculated separately. The plot below shows the importance of each feature at each specific date.

Feature importance map for the features used in this classification.

While other features (e.g. NDVI) in spring are generally more important, we see that there is a specific date when one of the bands (B2 — blue) is the most important feature. Taking a closer look, it turns out that the area at the time was covered with snow. It seems that the snow coverage unveils information about the underlying texture, which helps the classifier determine the correct land cover type. However, one should keep in mind that this fact is specific to the AOI that you are observing and generally cannot be relied upon.

A part of this AOI consisting of 3x3 EOPatches covered with snow.

Prediction Results

With the validation performed, we now understand the strengths and weaknesses of our prediction model. If we are not satisfied with the situation, it is possible to tweak the workflow and try again. Once the model is optimised, we define a simple EOTask which accepts the EOPatch and the classifier model, makes the prediction and applies it to the patch.

Sentinel-2 image (left), ground truth (centre) and prediction (right) for a random EOPatch in the selected AOI. Some differences are visible, which is mostly due to the application of the negative buffer on the reference map, otherwise the agreement is satisfactory for this use case.

From this point on, the path is clear. Repeat the procedure for all EOPatches. You can even export the predictions as GeoTIFF images, and merge them with

We also uploaded the merged GeoTIFF to our CloudGIS Geopedia portal, so you can view the results in greater detail here:

Screenshot of the land cover prediction for Slovenia 2017 using the approach shown in this blog post, available for detailed browsing in the CloudGIS Geopedia portal (

You can also compare official land use data with automatically classified land cover data. Note the land use and land cover difference, which represents a common challenge in ML processes — it is not always easy to map from classes used in official registers to classes that can be observed in nature. To illustrate this challenge, we show two airports in Slovenia. The first one is in Levec, near the city of Celje. This sports airfield is small, mostly used for private aeroplanes, and it is covered with grass. The official land use data marks the grass-covered landing strip as artificial surface, while the classifier is able to correctly predict the land cover as grassland, as shown below.

Sentinel-2 image (left), ground truth (centre) and prediction (right) for the area around the small sports airfield Levec, near Celje, Slovenia. The classifier correctly recognises the landing strip as grassland, which is marked as artificial surface in the official land use data.

On the other hand, looking at the Ljubljana Jože Pučnik Airport, the largest airport in Slovenia, the areas marked as artificial surface in the official land use data are tarmac runway areas and road networks. In this case, the classifier is able to recognise the built-up areas, while still correctly identifying the grassland and cultivated land in the surrounding sites.

Sentinel-2 image (left), ground truth (centre) and prediction (right) for the area around the Ljubljana Jože Pučnik Airport, the largest airport in Slovenia. The classifier recognises the tarmac runway and the road network, while still correctly identifying grassland and cultivated land in the surrounding area.

Now you know how to make a reliable prediction on the country scale! Make sure you put that in your CV. If you would like to continue developing awesome EO tools, don’t forget to send it to us as well, we are hiring!

We hope that these two blogs provide enough information for you to try to do it yourself. Using the exemplary Jupyter Notebook, you should be able to perform land classification for just about any area in the world, assuming you have some (reliable) reference data.

We are looking forward to getting any comments, ideas, and examples from you on how the process could be further improved.

We also plan to publish Part 3, the last part of this series, where we will show how to experiment with the ML workflow and try to improve the results! Additionally, we will openly share all EOPatches for Slovenia 2017 — that’s right, you heard correctly, the whole dataset, containing Sentinel-2 L1C data, s2cloudless cloud masks, reference data, etc., so everyone can try it out!
We cannot wait to see how you apply your own methods inside of eo-learn. :)
Stay tuned!