Sinergise is a proud partner of the New IACS Vision in Action (NIVA) H2020 project, which perfectly aligns with our continuous effort in advancing the Integrated Administration and Control System (IACS) within the EU’s Common Agricultural Policy (CAP). NIVA aims at improving IACS by exploiting the potential of digital technologies, eventually reducing administrative burden for farmers, paying agencies and other stakeholders.
Specifically, Sinergise’s role within NIVA is to develop an automatic system for parcel boundary delineation based on Sentinel-2 imagery. The delineated boundaries can aid the farmers speed up the declaration process and the paying agencies to better monitor changes in agricultural use. This system also provides complementary information to our area-monitoring service. In this blog post, we describe the steps of our workflow and show some of the achieved results.
In a nutshell, the aim of this system is to estimate the extent of agricultural parcels given cloud-free remote sensing imagery. Sentinel-2 satellites provide high revisit frequency and high spatial resolution (i.e. 10–20–60 meter spatial resolutions) imagery, making it an ideal candidate for such an application, in particular given that it is freely available. As several paying agencies are part of NIVA, access to geo-spatial aid application (GSAA) declarations was available for training of a supervised deep learning model, and that is the route we took.
In this post we’ll focus on the entire Lithuania, using GSAA data collected in 2019 and applied to the year 2020. Here is the summary workflow:
- download Sentinel-2 L1C imagery for 2019 and 2020 (March-September);
- prepare 2019 GSAA data for training of the ML models;
- train and evaluate the ML models on 2019 GSAA and Sentinel-2 imagery;
- predict and post-process agricultural parcel boundaries on Sentinel-2 2020 imagery.
Let’s dive-in into more detail on each of these steps and share some of the things we learnt.
Downloading a year’s worth of countrywide data in half an hour
Recently, the batch processor API has been created to efficiently deal with the download and processing of images over large areas. The improvement in costs and speed over using manual splitting and downloading of the data is significant, as show-cased in this blog-post. The speed and convenience of batch API was very apparent when we succeeded to download half a year of data for the whole country in less than half an hour — a process that would have previously taken close to a day — all with significantly less costs.
Sentinel-2 L1C acquisitions for the 10m bands (i.e. B02, B03, B04 and B08 which correspond to B-G-R-NIR spectral bands) and cloud probabilities CLP for the periods March 2019-September 2019 and March 2020-September 2020 were downloaded, and the resulting .tiff tiles were converted to EOPatches using eo-learn tasks. In a couple of hours we had the Sentinel-2 images we needed to start the processing.
Add reference data
The declared GSAA vector data for 2019 and 2020 were provided by the Lithuanian paying agency. These data are a collection of agricultural parcels declared by farmers and monitored by the paying agency. For a given year, the majority of the GSAA parcels correctly match the agricultural land cover, with the exception of a minority of incorrect parcels resulting from outdated information or incorrect applications. The aim of the automatic parcel delineation is to find and correct incorrect occurrences, as early in the application process as possible. The GSAA vectors for 2019 were used for training of the ML model, while the 2020 GSAA vectors were used for evaluation only.
The vectors were added to each existing EOPatch by importing the shapefile into a local PostgreSQL database with PostGIS extension and by creating a spatial index. An EOTask was then created to query the database to retrieve the geometries intersecting the EOPatch bounding box. This process allows very fast and parallelisable processing. The vectors in each EOPatch were then rasterised into an extent, boundary and distance images (i.e. distance transform of the extent mask).
The Sentinel-2 images were stored in ~800 1000 × 1000 EOPatches covering the entire country. Given that working directly on such large files is unpractical, sampling smaller patchlets from the original EOPatches was required for further steps. Using sampling, a constraint can be made on the minimal area covered by the reference GSAA in each sample, allowing to control the amount of positive and negative examples the model sees. For these reasons, samples of size 256 x 256 were extracted, making the whole process more memory friendly and easier to work with.
An EOTask was created to execute sampling of the EOPatches. For each EOPatch, a maximum of 10 patchlets of 256x256 size are sampled such that at least 40% of the area is covered by GSAA and the Sentinel-2 image has a cloud coverage less than 5% and no invalid data. The cloud coverage is determined by smoothing and masking the CLP layer.
Patchlets to .npz files
Because of memory limitations and the speed of loading the data during training, the patchlets need to be further split into smaller chunks of manageable size (~2GB) that can be efficiently loaded into the available RAM during model training. Here we rely heavily on the implementation in our open source eo-flow tool that provides various utilities for combining TensorFlow and geospatial data. Each .npz file contains keys for features, labels, timestamps and the name of the patchlet from which it was taken. All in all, ~40 .npz files with appropriate keys were created. Each timestamp of the sampled observation is it’s own row in one of the the .npz files. To make further processing of the .npz files easier, we created a pandas dataframe with all the relevant information about the .npz files and their content at each index.
Intermezzo: Single-scene or multi-temporal?
Reviewing the state-of-the-art in semantic segmentation of temporal images, two main approaches can be considered:
- apply semantic segmentation on each single-scene separately, and combine predictions temporally at a later stage. A model trained this way should learn to be invariant to the time-period of interest;
- apply semantic segmentation to a temporal stack of images, letting the model extract relevant spatio-temporal features for the task at hand. This approach tends to generate larger and slower models, as the input images contain temporal as well as spatial information (and spectral of course), but implicitly takes into account temporal dependencies.
The aim of the parcel delineation in CAP practices is generally to monitor agricultural land cover throughout the growing season, but the beginning of the season is of particular interest as is typically the time the farmers fill in their applications. A model that can generalise to different time periods seemed therefore useful in this perspective, and that justifies our choice of training a single-scene model and combining temporally the predictions in a second stage. The recent and excellent paper by Waldner et al. (Deep learning on edge: extracting field boundaries from satellite images with a convolutional neural network) represents the state-of-the-art for this approach, and is what we aimed for.
As we want our model to perform well over timestamps taken over the whole year, it’s very important how the data is normalized. For each individual patchlet timestamp, statistics that could be used for normalization are calculated for each band (B01, B02, B03, B04).
As well as computing overall statistics, we evaluated these temporally to establish whether a single or multiple normalization factors for the time period of interest were needed. We looked at mean, standard deviation and quantiles for each month. The data showed quite significant time-dependence, which lead us to normalise our data per-month. We decided to normalise our data with respect to the mean and standard deviation values (a common choice in DL applications), but later realised this is not optimal, when we computed the same statistics for the 2020 images, and saw significant differences in the mean values between 2019 and 2020 data. This was due to very large outlier values in the 2019 data, which negatively affected the mean values (see caption of figure above). Normalising w.r.t. median values is, in this case, a better choice.
Train / Validation / Test split
For a reliable estimate of the model performance and its generalisation, choosing appropriate train, test and validation sets is paramount. Since small patchlets can be sampled from the same EOPatches and can overlap, split of the data has to be performed at the EOPatch level. Furthermore, all timestamps of a patchlet need to belong to the same split. Following these criteria, we split the 2019 data into a test dataset (i.e. 11 3x3 EOPatch regions spread across Lithuania), a validation dataset (i.e. approximately 10% of the EOPatches not contained in the test set) and a training set (i.e. all the remaining data).
Model training / evaluation
Different models were tested, building on top of a vanilla U-net architecture, with each version adding components building up to the architecture proposed in Waldner et al.. The versions implemented and validated are as follows:
- V1: Vanilla U-net implementation with 3 independent multi-task output.
- V2: The conditioned multi-task output was implemented, and
MaxPoolinglayers replaced with
stride=2. From this version on, the
TanimotoLosswas used as in the paper.
- V3: Residual convolutional blocks with dilation were added to V2. Residual layers should improve gradient flow in very deep networks (since the input features are passed directly to the output), while dilations effectively increase the receptive field of convolutions.
- V4: The
PyramidPoolingModulelayer was added to V3. The aim of this layer is to provide global context, since it combines features pooled at very different scales, e.g. 1,2,4,8 resulting feature size.
- V5: Corresponding to a deeper version of V4, with same convolutions reported in original paper (4 parallel residual layers, and skip connections between first and last layer).
Different components (e.g. residual convolutions, pyramid pooling) were implemented and tested separately to understand their influence on the results. The components and the final architecture can be found in our open-source repository for deep learning eo-flow. All models were trained for 30 epochs using the Adam optimiser with a fixed learning rate.
The trained models were optimised on the cross-validation dataset, and the following metrics were recorded:
- Tanimoto loss (introduced in the ResUNet-a paper) for extent, boundary and distance transform (and sum of the three). These are used during training to update the values of the parameters. The lower the better.
- Accuracy for extent, boundary and distance transform. The accuracy measures the ratio of True Positives (TP) and True Negatives (TN) over all pixels. This metric alone is misleading when classes are imbalanced (as in our case, where 0s and 1s are very imbalanced in both extent and boundary masks). Between 0 and 1, the higher the better.
- Matthew’s correlation coefficient (MCC) for extent, boundary and distance transform. This metric provides a better performance estimate in case of imbalanced classes, as it explicitly considers also False Positives (FP) and False Negatives (FN) in its calculation. Ranges between -1 and 1, the higher the better;
- Intersection over Union for extent, boundary and distance transform. This metric measures the degree of overlap between predicted and ground-truth objects. It ranges between 0 and 1, the higher the better.
In general, V5 showed the best overall performance for losses, accuracies and MCC values for extent and boundary masks.
Evaluation results on 2019 test dataset
║ Model ║ Tanimoto loss ║ Extent MCC ║ Boundary MCC ║
║ V1 ║ 1.9692 ║ 0.7055 ║ 0.4770 ║
║ V2 ║ 1.8253 ║ 0.7433 ║ 0.5315 ║
║ V3 ║ 1.8399 ║ 0.7273 ║ 0.5352 ║
║ V4 ║ 1.8066 ║ 0.7518 ║ 0.5427 ║
║ V5 ║ 1.7786 ║ 0.7631 ║ 0.5716 ║ ╚═══════════════╩═══════════════╩═══════════════╩═══════════════╝
For the boundary mask in particular, V3, V4 and V5 produced better results than V2, while for the extent mask and distance transform the results were somewhat comparable. The largest improvement was gained already in V2, with the introduction of the conditioned output, which seems to have improved largely the prediction of the boundary mask and distance transform. It appears that residual layers introduce over-fitting to the training dataset (i.e. scores on training dataset are larger than on the independent cross-validation set) compared to V2.
Generalisation to 2020
Once we developed and evaluated the models on 2019 data, we wanted to investigate how the same model would generalise to 2020 data, and whether fine-tuning to the new year would be necessary. After realising the differences in bands statistics, and applying the correct normalisation factors, we run the trained models on the normalised images, and obtained the following scores.
Evaluation results on 2020 test dataset
║ Model ║ Tanimoto loss ║ Extent MCC ║ Boundary MCC ║
║ V1 ║ 1.9492 ║ 0.7056 ║ 0.5011 ║
║ V2 ║ 1.8013 ║ 0.7461 ║ 0.5553 ║
║ V3 ║ 1.8439 ║ 0.7101 ║ 0.5536 ║
║ V4 ║ 1.7855 ║ 0.7526 ║ 0.5651 ║
║ V5 ║ 1.7494 ║ 0.7654 ║ 0.6009 ║ ╚═══════════════╩═══════════════╩═══════════════╩═══════════════╝
Looking in particular at the extent scores, the model generalises nicely to 2020 data without the need for fine-tuning or re-training. This means that the distribution of DN values across years is very similar, as is the appearance and delineation style of reference GSAA data. Given these results, we applied the 2019 model on all the EOPatches with 2020 data.
Postprocessing (merging / vectorization)
The postprocessing of the single-scene model predictions (i.e. output of the final
softmax layer or pseudo-probabilities) is split into two main parts:
- Temporal merging of predictions
- Vectorization of predictions and merging of vectors across EOPatches
After applying the model on each available Sentinel-2 image, temporal aggregation of the predictions was necessary. In order to do so effectively, we investigated how the quality of the results changed as a function of the number of single-scene images that are temporally aggregated, and the percentile used for the aggregation (for both the boundary and extent pseudo-probabilities). We ran the MCC scores for 8 pairs of percentiles and looked at how the mean and standard deviation scores changed depending on the number of single-scene images merged together. From this analysis we could observe that good ol’ median seems to work best, and that merging predictions on multiple single-scenes does have a beneficial effect on performance, even though slightly.
From the step above, given a time interval, we can aggregate predictions and obtain a single pseudo-probability image for extent, boundary and distance. We now combine these and obtain a vector layer for the entire country. To obtain smoother vectors, the pseudo-probabilites are combined into a single image as
1 + pseudo_prob_extent - pseudo_prob_boundary (we didn’t use the distance masks in this iteration). This resulting image has continuous values in the (0, 2) range, and can be treated as a level set functional that can be sectioned to obtain nice and smooth contours. To obtain contours from the raster image, we used the GDAL contour utility. Different thresholds in regards to the -i parameter in the GDAL contour utility were evaluated and the one giving better overlaps with the GSAA vectors was chosen.
Another very useful GDAL feature that we used is the Virtual format, which allowed us to build a virtual raster containing the merged functionals of all EOPatches, so that predictions from neighbouring EOPatches could be blent into a consensus functional. The contouring could then be run in smaller, parallel, and overlapping areas, generating vector shapes that are matching over the overlapping area. To obtain a single vector layer, the overlapping geometries were merged performing a geometrical union.
If you have endured through the length of the post, you deserve to have a look at the results! We’ve put together a simple UI that allows you to interact with the results obtained for June 2020 for the entire Lithuania. We derived predictions for different phases of the growing season, and found the best agreement with GSAA data for May/June. However, the information derived from the other months can be used to pinpoint entries of the GSAA dataset that would require manual checks from the paying agency. An example of such application is shown in the following figure, where the overlap between the predicted parcels and the declarations is color coded (red/green) to facilitate pin-pointing of possible incorrect declarations.
Once we set-up the entire workflow, we were able to execute it for 2020 in less than a couple of days.
To explore the results obtained for the entire Lithuania for June 2020, visit parcelio.sentinel-hub.com and enjoy!
Scrolling through the results, you will notice few peculiar false positive delineations over water, wetland and similar land cover. We believe these issues are due to a lack of negative examples seen by the model at training time, as patchlets were specifically chosen to have at least 40% of GSAA reference data. Better dataset preparation should mitigate these issues. Another interesting feature to look out for are greenhouses, often declared in GSAA dataset, but with clear visual differences from the other parcels.
If you have found other interesting artefacts, if you want to tell us how good/bad the predicted parcels look or if you are interested in more details about our parcel delineation, get in touch at firstname.lastname@example.org.
Try it out yourself!
We are trying to be rather open about what we do, and how we do it. So we have published a minimal set of notebooks that should get you going. Give it a try, if you’d like to see if the approach works in your area of interest and with your ground truth data.
This post is one of the series of blogs related to our work in Area Monitoring. We have decided to openly share our knowledge on this subject as we believe that discussion and comparison of approaches are required among all the groups involved in it. We would welcome any kind of feedback, ideas and lessons learned. For those willing to do it publicly, we are happy to host them at this place.
- High-Level Concept
- Data Handling
- Outlier detection
- Similarity Score
- Bare Soil Marker
- Mowing Marker
- Crop Type Marker
- Homogeneity Marker
- Parcel Boundary Detection (this post)
- Land Cover Classification (still to come)
- Minimum Agriculture Activity (still to come)
- Combining the Markers into Decisions
- Traffic Light System
- Expert Judgement Application