Sentinel-2 image of an area in Slovenia, blending into a map of predicted land cover classes.

Land Cover Classification with eo-learn: Part 1

Mastering Satellite Image Data in an Open-Source Python Environment

Foreword

About a half a year ago the very first commit was pushed to the eo-learn GitHub page. Today, eo-learn has grown into a remarkable piece of open-source software, ready to be put to use by anyone who is curious about EO data. Even the members of the EO-research team here at Sinergise have long awaited for the moment to finally switch from building the necessary tools, to actually using them for data science and machine learning. The time has come to present a series on land use and land cover classification, using eo-learn.

eo-learn is an open-source Python library that acts as a bridge between Earth Observation/Remote Sensing and Python ecosystem for data science and machine learning. We already have a dedicated blog post here, which you are encouraged to read. The library uses numpy arrays and shapely geometries to store and handle remote sensing data. It is currently available in our GitHub repo and you can find further documentation at the ReadTheDocs page.

Sentinel-2 image and the overlaid NDVI mask of an area in Slovenia, taken in the winter season.

To showcase eo-learn, we have decided to present our multi-temporal processing pipeline for land-use-land-cover classification for the Republic of Slovenia (the country we live in), using annual data for the year 2017. Since the whole procedure might be a bit overwhelming in a single-story format, we have decided to split it in two parts, which, at the same time, forces us not to rush towards training the classifier, but to first really understand the data we’re dealing with. Each part will be accompanied by an example Jupyter Notebook. However, for those curious , we have already prepared a full notebook, covering all the steps.

  • In the first part of the series, we will guide you through the process of selecting/splitting an area-of-interest (AOI), and obtaining the corresponding information like Sentinel-2 band data and cloud masks. An example of how to add a raster reference map from vector data is also shown. All of these are necessary steps towards obtaining a reliable classification result.
  • In the second part, we will really put on our working gloves for preparing the data for machine learning. This involves randomly sampling a subset of the training/testing pixels, filtering out scenes that are too cloudy, performing linear interpolation in the temporal dimension to “fill-the-gaps”, and so on. When the data is prepared, we will train our classifier, validate it, and, of course, show some pretty plots!
Sentinel-2 image and the overlaid NDVI mask of an area in Slovenia, taken in the summer season.

Area-of-Interest? Take Your Pick!

The framework of eo-learn allows splitting the AOI into smaller patches that can be processed with limited computational resources. In this example, the boundary of the Republic of Slovenia (RS) was taken from Natural Earth, however, an AOI of any size can be selected. A buffer was added to the boundary, so the resulting bounding box of RS has a size of about 250 km × 170 km. Using the magic of geopandas and shapely Python packages, we implemented a tool for splitting the AOI. In this case we split the country-wise bounding box into 25 × 17 equal parts, which results in ~300 patches of about 1,000 x 1,000 square pixels at a 10 m resolution. The splitting choice depends on the amount of available resources, so the pipeline can be executed on a high-end scientific machine (with a large number of CPU’s and a large memory pool), as well as on a laptop (we try our best to reach out to users of all scales). The output of this step is a list of bounding boxes covering the AOI.

The area-of-interest (Republic of Slovenia) split into smaller patches of approximately 1000 x 1000 square pixels at 10 m resolution.

Obtaining Open-Access Sentinel Data

With the bounding boxes of the empty patches in place, eo-learn enables the automatic download of Sentinel image data. In this example, we obtain the Sentinel-2 L1C bands for each patch for acquisition dates within the 2017 calendar year. However, Sentinel-2 L2A products or additional imaging sources (e.g. Landsat-8, Sentinel-1) could similarly be added to the processing pipeline. In fact, using L2A products might improve the classification results, but we decided to use L1C products to make the process globally applicable. This was executed using sentinelhub-py, a Python package that acts as a wrapper for the Sentinel-Hub OGC web services. Sentinel-Hub services are subscription-based, but free accounts for research institutes and start-ups are available.

True-colour images of a single patch at different time frames. Some frames are cloudy, indicating the need for a cloud detector.

In addition to the Sentinel data, eo-learn makes it possible to seamlessly access cloud masks and cloud probabilities, generated with the open-source s2cloudless Python package. This package provides automated cloud detection in Sentinel-2 L1C imagery and is based on a single-scene pixel-based classification. It is described in detail in this blog.

Cloud probability masks of a single patch for different time frames (same as above). The colour scale represents the probability for a cloudy pixel, ranging from blue (low probability) to yellow (high probability).

Adding the Reference Data

Supervised classification methods require a reference map, or ground truth. The latter term should not be taken literally, as the reference map is a mere approximation of what lies on the ground. Unfortunately, the classifier performance greatly depends on the quality of the reference map, as is the case for most machine learning problems (see the garbage in, garbage out principle). Reference maps are most commonly available as vector data in a shapefile (e.g. provided by the government or open-source communities). eo-learn already has existing functionality to burn the vector data into a patch as a raster mask.

The process of burning vector data into raster masks for a single patch. The left image shows the plotted polygons of the provided vector file, the centre image shows the split raster masks for each land-cover label, black and white indicating the positive and negative samples, respectively. The image on the right shows the merged raster mask with different colours for different labels.

Put it All Together

All of these tasks behave as building blocks and can be put together into a nifty workflow, which is then executed for each patch. Due to the potentially large number of such patches, an automation of the processing pipeline is absolutely crucial.

Code snippet of the presented pipeline, which is executed for each patch.

Getting familiar with the data at hand is one of the first steps a data-scientist should take. By utilising the cloud masks on the Sentinel-2 image data, one can, for example, determine the numbers of valid observations for all pixels, or even the average cloud probabilities over an area. This gives a deeper insight into the data, which comes in handy when you’re tackling inevitable problems later in the pipeline.

True colour image (left), map of valid pixel counts for the year 2017 (centre), and the averaged cloud probability map for the year 2017 (right) for a random patch in the AOI.

One might even be interested in the mean NDVI over an area after filtering out the clouds. This can easily be done by applying the cloud masks over the regions and calculating the mean of any features only for the valid pixels. By applying the cloud masks we are able to clean the features, making their role more important in the classification step.

Mean NDVI of all pixels in a random patch throughout the year. The blue line shows the result with cloud filtering applied, while the orange line shows the calculation with clouds taken into account.

But, Will it Scale?

Once the single-patch setup is complete, the only thing left to do is to let eo-learn do the same process for all patches automatically and, if resources allow, in parallel, while you relax with a cup of coffee and think about how the big boss will be impressed by all this work that you’ve done. When the work is finished and your machine can take a breath, it’s possible to export the data of interest into GeoTIFF images. The gdal_merge.py script then takes these images, rules them all, brings them together, and in the darkness binds them to create a countrywide result image.

Number of valid Sentinel-2 observations for this AOI in the year 2017. The regions with higher count numbers are areas where the swaths of both Sentinel-2A and B overlap, while this does not happen in the middle part of the AOI.

In the image above, we see that there are twice as many valid pixels over the course of the year at the left/right edges of the AOI, due to the overlapping swaths of the Sentinel-2A and B. This makes our input heterogeneous, meaning that we need to take steps in order to unify our input, such as performing interpolation in the temporal dimension.

Executing the presented pipeline in sequence takes about 140 seconds per patch, amounting to ~12 hours to run the process over the whole AOI. The entire AOI consists of about 300 patches, corresponding to an area of 20,000 km². Most of this time is spent downloading Sentinel-2 image data. An average uncompressed patch with the described setup takes about 3 GB of storage, amounting to ~1 TB of storage for the entire AOI. If resources allow, applying the process on several CPU’s is also possible, which should reduce the overall time consumption of the application.

A Jupyter Notebook Example

In order to more easily immerse you into the code of eo-learn we have prepared an example which covers the discussed material in this post. The example is written in the handy Jupyter Notebook and you can find it in the examples directory of the eo-learn package. Feel free to contact us, if you have feedback, or create a pull-request!

That’s it! Now you know how to start preparing an AOI for your own application of land cover classification, or even something else. Let us know in what kind of interesting ways you are applying eo-learn!

By the time the next part of the blog series is out, your AOI patches will hopefully be full of Sentinel data and ready for the next steps in the land-use-land-classification pipeline!

Good luck!


Meet us at ESA Φ-week 2018

European Space Agency is organising its biggest event to this date, Φ-week, from 12th to 16th of November 2018, in Frascati, Italy. Our Sentinel Hub team will be there in its almost full capacity, so make sure you come and meet us.