Land Cover Classification with eo-learn: Part 1
Mastering Satellite Data in an Open-Source Python Environment
Foreword
About a half a year ago the very first commit was pushed to the eo-learn
GitHub page. Today, eo-learn
has grown into a remarkable piece of open-source software, ready to be put to use by anyone who is curious about EO data. Even the members of the EO-research team here at Sinergise have long awaited for the moment to finally switch from building the necessary tools, to actually using them for data science and machine learning. The time has come to present a series on land use and land cover classification, using eo-learn
.
eo-learn
is an open-source Python library that acts as a bridge between Earth Observation/Remote Sensing and Python ecosystem for data science and machine learning. We already have a dedicated blog post here, which you are encouraged to read. The library uses numpy
arrays and shapely
geometries to store and handle remote sensing data. It is currently available in our GitHub repo and you can find further documentation at the ReadTheDocs page.
To showcase eo-learn
, we have decided to present our multi-temporal processing pipeline for land-use-land-cover classification for the Republic of Slovenia (the country we live in), using annual data for the year 2017. Since the whole procedure might be a bit overwhelming in a single-story format, we have decided to split it in two parts, which, at the same time, forces us not to rush towards training the classifier, but to first really understand the data we’re dealing with. This whole story is accompanied by an example Jupyter Notebook which covers all the steps, but more info about the notebook at the bottom of the page!
- In the first part of the series, we will guide you through the process of selecting/splitting an area-of-interest (AOI), and obtaining the corresponding information like Sentinel-2 band data and cloud masks. An example of how to add a raster reference map from vector data is also shown. All of these are necessary steps towards obtaining a reliable classification result.
- In the second part, we will really put on our working gloves for preparing the data for machine learning. This involves randomly sampling a subset of the training/testing pixels, filtering out scenes that are too cloudy, performing linear interpolation in the temporal dimension to “fill-the-gaps”, and so on. When the data is prepared, we will train our classifier, validate it, and, of course, show some pretty plots!
Area-of-Interest? Take Your Pick!
The framework of eo-learn
allows splitting the AOI into smaller patches that can be processed with limited computational resources. In this example, the boundary of the Republic of Slovenia (RS) was taken from Natural Earth, however, an AOI of any size can be selected. A buffer was added to the boundary, so the resulting bounding box of RS has a size of about 250 km × 170 km. Using the magic of geopandas
and shapely
Python packages, we implemented a tool for splitting the AOI. In this case we split the country-wise bounding box into 25 × 17 equal parts, which results in ~300 patches of about 1,000 x 1,000 square pixels at a 10 m resolution. The splitting choice depends on the amount of available resources, so the pipeline can be executed on a high-end scientific machine (with a large number of CPU’s and a large memory pool), as well as on a laptop (we try our best to reach out to users of all scales). The output of this step is a list of bounding boxes covering the AOI.
Obtaining Open-Access Sentinel Data
With the bounding boxes of the empty patches in place, eo-learn
enables the automatic download of Sentinel image data. In this example, we obtain the Sentinel-2 L1C bands for each patch for acquisition dates within the 2017 calendar year. However, Sentinel-2 L2A products or additional imaging sources (e.g. Landsat-8, Sentinel-1) could similarly be added to the processing pipeline. In fact, using L2A products might improve the classification results, but we decided to use L1C products to make the process globally applicable. This was executed using sentinelhub-py
, a Python package that acts as a wrapper for the Sentinel-Hub OGC web services. Sentinel-Hub services are subscription-based, but free accounts for research institutes and start-ups are available.
In addition to the Sentinel data, eo-learn
makes it possible to seamlessly access cloud masks and cloud probabilities, generated with the open-source s2cloudless
Python package. This package provides automated cloud detection in Sentinel-2 L1C imagery and is based on a single-scene pixel-based classification. It is described in detail in this blog.
Adding the Reference Data
Supervised classification methods require a reference map, or ground truth. The latter term should not be taken literally, as the reference map is a mere approximation of what lies on the ground. Unfortunately, the classifier performance greatly depends on the quality of the reference map, as is the case for most machine learning problems (see the garbage in, garbage out principle). Reference maps are most commonly available as vector data in a shapefile (e.g. provided by the government or open-source communities). eo-learn
already has existing functionality to burn the vector data into a patch as a raster mask.
Put it All Together
All of these tasks behave as building blocks and can be put together into a nifty workflow, which is then executed for each patch. Due to the potentially large number of such patches, an automation of the processing pipeline is absolutely crucial.
Getting familiar with the data at hand is one of the first steps a data-scientist should take. By utilising the cloud masks on the Sentinel-2 image data, one can, for example, determine the numbers of valid observations for all pixels, or even the average cloud probabilities over an area. This gives a deeper insight into the data, which comes in handy when you’re tackling inevitable problems later in the pipeline.
One might even be interested in the mean NDVI over an area after filtering out the clouds. This can easily be done by applying the cloud masks over the regions and calculating the mean of any features only for the valid pixels. By applying the cloud masks we are able to clean the features, making their role more important in the classification step.
“But, Will it Scale?”
Once the single-patch setup is complete, the only thing left to do is to let eo-learn
do the same process for all patches automatically and, if resources allow, in parallel, while you relax with a cup of coffee and think about how the big boss will be impressed by all this work that you’ve done. When the work is finished and your machine can take a breath, it’s possible to export the data of interest into GeoTIFF images. The gdal_merge.py
script then takes these images, rules them all, brings them together, and in the darkness binds them to create a countrywide result image.
In the image above, we see that there are twice as many valid pixels over the course of the year at the left/right edges of the AOI, due to the overlapping swaths of the Sentinel-2A and B. This makes our input heterogeneous, meaning that we need to take steps in order to unify our input, such as performing interpolation in the temporal dimension.
Executing the presented pipeline in sequence takes about 140 seconds per patch, amounting to ~12 hours to run the process over the whole AOI. The entire AOI consists of about 300 patches, corresponding to an area of 20,000 km². Most of this time is spent downloading Sentinel-2 image data. An average uncompressed patch with the described setup takes about 3 GB of storage, amounting to ~1 TB of storage for the entire AOI. If resources allow, applying the process on several CPU’s is also possible, which should reduce the overall time consumption of the application.
A Jupyter Notebook Example
In order to more easily immerse you into the code of eo-learn
we have prepared an example which covers the discussed material in this post. The example is written in the handy Jupyter Notebook and you can find it in the examples directory of the eo-learn
package. Feel free to contact us, if you have feedback, or create a pull-request!
That’s it! Now you know how to start preparing an AOI for your own application of land cover classification, or even something else. Let us know in what kind of interesting ways you are applying eo-learn
! Speaking of applying, we are also hiring new people to help us develop and improve in this era of machine learning in EO, contact us for an opportunity to work together!
By the time the next part of the blog series is out, your AOI patches will hopefully be full of Sentinel data and ready for the next steps in the land-use-land-classification pipeline!
Good luck!
Link to Part 2: https://medium.com/sentinel-hub/land-cover-classification-with-eo-learn-part-2-bd9aa86f8500
Link to Part 3: https://medium.com/sentinel-hub/land-cover-classification-with-eo-learn-part-3-c62ed9ecd405
eo-learn
is a by-product of the Perceptive Sentinel European project. The project has received funding from European Union’s Horizon 2020 Research and Innovation Programme under the Grant Agreement 776115.