Land Cover Classification with eo-learn: Part 1
Mastering Satellite Image Data in an Open-Source Python Environment
About a half a year ago the very first commit was pushed to the
eo-learn GitHub page. Today,
eo-learn has grown into a remarkable piece of open-source software, ready to be put to use by anyone who is curious about EO data. Even the members of the EO-research team here at Sinergise have long awaited for the moment to finally switch from building the necessary tools, to actually using them for data science and machine learning. The time has come to present a series on land use and land cover classification, using
eo-learn is an open-source Python library that acts as a bridge between Earth Observation/Remote Sensing and Python ecosystem for data science and machine learning. We already have a dedicated blog post here, which you are encouraged to read. The library uses
numpy arrays and
shapely geometries to store and handle remote sensing data. It is currently available in our GitHub repo and you can find further documentation at the ReadTheDocs page.
eo-learn, we have decided to present our multi-temporal processing pipeline for land-use-land-cover classification for the Republic of Slovenia (the country we live in), using annual data for the year 2017. Since the whole procedure might be a bit overwhelming in a single-story format, we have decided to split it in two parts, which, at the same time, forces us not to rush towards training the classifier, but to first really understand the data we’re dealing with. Each part will be accompanied by an example Jupyter Notebook. However, for those curious , we have already prepared a full notebook, covering all the steps.
- In the first part of the series, we will guide you through the process of selecting/splitting an area-of-interest (AOI), and obtaining the corresponding information like Sentinel-2 band data and cloud masks. An example of how to add a raster reference map from vector data is also shown. All of these are necessary steps towards obtaining a reliable classification result.
- In the second part, we will really put on our working gloves for preparing the data for machine learning. This involves randomly sampling a subset of the training/testing pixels, filtering out scenes that are too cloudy, performing linear interpolation in the temporal dimension to “fill-the-gaps”, and so on. When the data is prepared, we will train our classifier, validate it, and, of course, show some pretty plots!
Area-of-Interest? Take Your Pick!
The framework of
eo-learn allows splitting the AOI into smaller patches that can be processed with limited computational resources. In this example, the boundary of the Republic of Slovenia (RS) was taken from Natural Earth, however, an AOI of any size can be selected. A buffer was added to the boundary, so the resulting bounding box of RS has a size of about 250 km × 170 km. Using the magic of
shapely Python packages, we implemented a tool for splitting the AOI. In this case we split the country-wise bounding box into 25 × 17 equal parts, which results in ~300 patches of about 1,000 x 1,000 square pixels at a 10 m resolution. The splitting choice depends on the amount of available resources, so the pipeline can be executed on a high-end scientific machine (with a large number of CPU’s and a large memory pool), as well as on a laptop (we try our best to reach out to users of all scales). The output of this step is a list of bounding boxes covering the AOI.
Obtaining Open-Access Sentinel Data
With the bounding boxes of the empty patches in place,
eo-learn enables the automatic download of Sentinel image data. In this example, we obtain the Sentinel-2 L1C bands for each patch for acquisition dates within the 2017 calendar year. However, Sentinel-2 L2A products or additional imaging sources (e.g. Landsat-8, Sentinel-1) could similarly be added to the processing pipeline. In fact, using L2A products might improve the classification results, but we decided to use L1C products to make the process globally applicable. This was executed using
sentinelhub-py, a Python package that acts as a wrapper for the Sentinel-Hub OGC web services. Sentinel-Hub services are subscription-based, but free accounts for research institutes and start-ups are available.
In addition to the Sentinel data,
eo-learn makes it possible to seamlessly access cloud masks and cloud probabilities, generated with the open-source
s2cloudless Python package. This package provides automated cloud detection in Sentinel-2 L1C imagery and is based on a single-scene pixel-based classification. It is described in detail in this blog.
Adding the Reference Data
Supervised classification methods require a reference map, or ground truth. The latter term should not be taken literally, as the reference map is a mere approximation of what lies on the ground. Unfortunately, the classifier performance greatly depends on the quality of the reference map, as is the case for most machine learning problems (see the garbage in, garbage out principle). Reference maps are most commonly available as vector data in a shapefile (e.g. provided by the government or open-source communities).
eo-learn already has existing functionality to burn the vector data into a patch as a raster mask.
Put it All Together
All of these tasks behave as building blocks and can be put together into a nifty workflow, which is then executed for each patch. Due to the potentially large number of such patches, an automation of the processing pipeline is absolutely crucial.
Getting familiar with the data at hand is one of the first steps a data-scientist should take. By utilising the cloud masks on the Sentinel-2 image data, one can, for example, determine the numbers of valid observations for all pixels, or even the average cloud probabilities over an area. This gives a deeper insight into the data, which comes in handy when you’re tackling inevitable problems later in the pipeline.
One might even be interested in the mean NDVI over an area after filtering out the clouds. This can easily be done by applying the cloud masks over the regions and calculating the mean of any features only for the valid pixels. By applying the cloud masks we are able to clean the features, making their role more important in the classification step.
Once the single-patch setup is complete, the only thing left to do is to let
eo-learn do the same process for all patches automatically and, if resources allow, in parallel, while you relax with a cup of coffee and think about how the big boss will be impressed by all this work that you’ve done. When the work is finished and your machine can take a breath, it’s possible to export the data of interest into GeoTIFF images. The
gdal_merge.py script then takes these images, rules them all, brings them together, and in the darkness binds them to create a countrywide result image.
In the image above, we see that there are twice as many valid pixels over the course of the year at the left/right edges of the AOI, due to the overlapping swaths of the Sentinel-2A and B. This makes our input heterogeneous, meaning that we need to take steps in order to unify our input, such as performing interpolation in the temporal dimension.
Executing the presented pipeline in sequence takes about 140 seconds per patch, amounting to ~12 hours to run the process over the whole AOI. The entire AOI consists of about 300 patches, corresponding to an area of 20,000 km². Most of this time is spent downloading Sentinel-2 image data. An average uncompressed patch with the described setup takes about 3 GB of storage, amounting to ~1 TB of storage for the entire AOI. If resources allow, applying the process on several CPU’s is also possible, which should reduce the overall time consumption of the application.
A Jupyter Notebook Example
In order to more easily immerse you into the code of
eo-learn we have prepared an example which covers the discussed material in this post. The example is written in the handy Jupyter Notebook and you can find it in the examples directory of the
eo-learn package. Feel free to contact us, if you have feedback, or create a pull-request!
That’s it! Now you know how to start preparing an AOI for your own application of land cover classification, or even something else. Let us know in what kind of interesting ways you are applying
eo-learn! Speaking of applying, we are also hiring new people to help us develop and improve in this era of machine learning in EO, contact us for an opportunity to work together!
By the time the next part of the blog series is out, your AOI patches will hopefully be full of Sentinel data and ready for the next steps in the land-use-land-classification pipeline!
Update (2019–01–10): The second part of the series is available here: https://medium.com/sentinel-hub/land-cover-classification-with-eo-learn-part-2-bd9aa86f8500