Continental-scale crop type mapping in openEO platform

In 2022, we set out to produce a continental scale croptype map using openEO platform. In this post, we describe the experience, lessons learned, and successes. It focuses on the user-experience and production of the map, while leaving the methodological aspects of producing an accurate crop type map aside. The most relevant aspect of methodology is that we did not compromise on the used workflow to simplify the processing. One more disclaimer before we start: the experiment was carried out by the same team that develops openEO backends and Python client.

Jeroen Dries

Published in

openEO Platform

7 min readMay 15, 2023

Problem setting

Generating large scale, high-resolution crop type maps based on earth observation data is a task that requires a team with a broad skill-set. We see this as an important impediment to maximally realizing the potential of the large archives of free and open earth observation data that are available today.

Taking a closer look, we identify the following challenges:

Development of a large scale map requires gathering training data at a large scale.
Producing maps requires mastering large scale compute power, available in cloud or HPC infrastructures.
Deploying & monitoring systems using DevOps techniques
Optimizing algorithms for parallel processing
FAIR principles (rich metadata & provenance)
Connecting to data catalogs
Getting the processing pipeline right

On top of all that, you may want to adhere to open science principles by making your work reproducible.

Read on to find out how we address this with openEO!

The workflow

The croptype mapping workflow that we implemented follows a few fairly common steps:

The first part, referred to as ‘the preprocesssing’, takes an 8 month period of inputs:

WorldCover data is read and converted into a binary mask, retaining pixels marked as agriculture. This mask ensures that only relevant pixels are processed further.

Sentinel-2 L2A data is read, cloud-masked, and converted into 10-daily median composites. Linear interpolation fills remaining gaps due to clouds. Sentinel-1 GRD data is converted into sigma0 backscatter, 10-daily median composites are derived

Copernicus 30m DEM data is added as well, all this data is aligned to the same grid, and combined into a single datacube.

In the second ‘inference’ part, the preprocessed datacube is the input to a machine learning model, that takes a pixel timeseries as input. Our models did not take spatial context into account, as having multiple bands and the temporal profile was sufficient in this case.

The image below show example output of this workflow.

Mapping into openEO

To run workflows on openEO, we always need a translation into technology agnostic openEO workflows, called ‘process graphs’. For this case, we worked in Python, so the resulting code still looks like any other algorithm. We do point out that the openEO result is very short in terms of lines of Python code, and barely longer than the short summary above. This is possible because all of the complexity (finding data, efficient reading, handling overlap, backscatter computation, compositing,…) is handled on the backend side. This is a very clear advantage, as it reduces the code volume that data scientists have to maintain, and that reviewers have to understand to analyze the methodology or reproduce results, a key element in open science.

Applying custom Python code to an openEO datacube

To run the machine learning model, which is based on PyTorch, we used openEO ‘user defined functions’, which allow running arbitrary pieces of Python code that transform XArray data structures as part of an openEO workflow. This was needed because openEO predefined processes do not yet exist for running arbirtrary deep learning models. This is a key feature that drastically increases the number of use cases that can be translated into openEO, and is also convenient when starting from an existing code-base, where a full translation into openEO predefined functions would be costly.

Visual representation of the openEO ‘process graph’

Training the model

To train a machine learning model, we need to feed it with a lot of preprocessed input data. We performed this by sampling data at reference locations obtained from datasets like Eurocrops and Lucas. Sampling data at points or polygons is directly supported by the backends. In this case, we mostly sampled points by running ‘aggregate_spatial’ for up to 1000 points in a single call.

Sampling points over large geographic regions proved to be difficult, because this would involve multipe UTM zones, which was not yet supported by the backend. We therefore first had to cluster our reference data into geographically close areas.

Preparing for large scale processing

To kick off processing of our continental map, we had to run some tests, do some tuning, and make some choices. These are the headlines:

We choose to process directly in EPSG:3035, instead of UTM, so that all output would be in the same projection.
In the output projection, a 20km grid was constructed. This is relatively fine grained, but thus also avoids processing too much area outside of the EU27 area. It also means the cost to reprocess a specific tile (e.g. due to artifacts) is lower.
Per tile, we determined the amount of cropland based on WorldCover. This allowed us to process areas with lots of crops first.
The optimal settings per tile were determined and optimized. Processing could only start if the projected total cost was under a certain threshold. We halved the memory cost per executor in this phase, mainly by optimizations in the user defined function.
We determined that each single core cpu (worker) would get 3,8GB of memory. This number is mainly based on the virtual machine configurations available at the main cloud provider (CreoDIAS).
We wanted to use different openEO backends and data providers, so ran the same tile on each one. This revealed some discrepancies that we could fix by choosing the right parameters.

All this preparation resulted in a CSV file containing all the tiles that needed to be generated. This was the basis of the processing run.

More technical details can be found on this page.

Running processing

With everything prepared, we kicked of our processing. On openEO platform, we were able to run on both Terrascope and CreoDIAS infrastructure in parallel, giving us a reasonable amount of bandwidth, and ensuring that processing always kept going even in case an individual backend was having issues for some reason.

Because all of the heavy lifting is done on the backend, the coordination and follow up was even done on a regular laptop. We would not recommend this as a generally good practice, but it shows that you do not need that much more beyond the openEO service. This did mean that we occasionally had to restart processing when for instance the network was temporarily interrupted.

The processing script was constantly submitting jobs to the backends, and retrieving status updates, which did not pose a problem. We did choose to run at a fairly slow but constant pace, mainly because it allowed us to follow up and inspect output as it was being produced, allowing intervention and tuning when needed, and it was anyway happening over the christmas period, with little time pressure.

Artifacts, artifacts, artifacts…

Missing pixels, bad classification and gaps

As mentioned already, we did have to improve the backend still during processing, mainly because a large scale map has a tendency to reveal issues that you never encountered when working on a few test areas. This can be due to unexpected discrepancies in the data archives, or a very specific edge case in terms of mosaicing and combining input data that then leads to either errors or incorrect output data.

All these cases were eventually tracked down and resolved, but this clearly shows the difference between writing a prototype script and doing an operation production. 80% of our map was produced without issue or effort, but getting the remaining 20% right cost us most of the time. Luckily, this lead to improvements in the processing backend which are now there for any user of the openEO platform.

Final result

The final result consists of over 11000 cloud optimized Geotiff files, together with STAC metadata generated by openEO. The STAC metadata also contains links to input products, to ensure provenance. We do note that such a volume remains a hassle to manage and inspect, so are looking to allow automatic publishing into a STAC catalog and the setup of viewing services for inspecting the results at a large scale.

The final map is made available as an experimental collection in openEO platform, and can also be easily viewed in the openEO editor. It still has known flaws, so it is for demonstration purposes only, and is expected to be replaced by the end of 2023 with a more accurate map!