The value of super resolution — real world use case
Parcel boundary detection with super-resolved satellite imagery
There is a longstanding debate within the Earth Observation community (and beyond) about the value and real life applicability of super resolution (SR) imagery. Some see it as a gimmick unsuitable for any application that goes beyond visualization, while others view it as an essential component in bridging the gap between the cost of very high resolution imagery and the availability of medium resolution imagery such as Sentinel-2. One thing is clear — as satellite imagery only has value when someone is using it, SR imagery should as well be assessed on a particular use case.
One example that should play to the strengths of super resolution imagery is parcel boundary delineation. To see what (if anything) can be gained by using SR products for parcel boundary delineation, we evaluated the difference between automatic parcel delineation on super-resolved (to one meter) and Sentinel-2 imagery by utilizing a manually-annotated ground truth dataset.
For those not willing to read to the very end, the answer is “not that much, if at all”, at least in this specific example, this algorithm, and with this specific SR data. There seems to be improvements in the quality of the detected boundaries, but the “seem” part is problematic as there is no way to know to which extent results can be trusted. The inaccuracies of ML algorithms are a well-known issue and we still make use of their results. However, in most cases there is a way, albeit laborious one, to validate results against available data. This is not the case here.
Our team has been working with Sentinel data for many years and we’ve often wished it would have been of better resolution. We’ve tried a few approaches to improve it ourselves, we cooperated with others and were involved in the related challenges. But every time, after trying to systematically measure the benefits, we were left empty handed. So it might be that we are a bit biased. We are firm believers that there needs to be some scientifically demonstrated correlation between the data and the remote sensing results, even when using machine learning technologies. It seems we are not alone, like for example our colleagues from CESBIO.
Background on super resolution
Satellite imagery sensor, along with the post-processing steps, provides data at specific resolution. For Sentinel-2 this is limited to 10 meters per pixel. There are many approaches on how to improve this, among most popular being:
- Using a multi-temporal stack of data and converting it to single image with better resolution. The principle behind the idea is that pixels are not exactly aligned and it is therefore possible to extract this extra juice and get more detailed image. The main downside of this principle is that one loses the high-cadence monitoring component of the Sentinel mission, which is highly valuable. Furthermore, since a multi-temporal stack is used when predicting, the model is less sensitive to changes — a change might need to be visible on multiple images before it is predicted.
- Training a ML model to add bits and pieces based on how things should look like. In some cases the methods use only one sensor, in some cases more, i.e. combination of VHR and Sentinel-2. There are two problems with this approach. Our world is large and very complex and training a computer to know it all is challenging. Even more problematic, one of the main objectives of nowadays remote sensing is to identify things that are new or different, i.e. demolished buildings or abandoned agriculture field full of overgrowth. These will rarely be picked by a computer trained on the “normal”.
- Then there is one well-tried and thus somehow boring approach — interpolation. Using bicubic interpolation we simply fill the missing details with polynomial functions. It often looks very nice, but we should not fool ourselves into believing the new details are real. There are not many discontinuities in our world, so information will mostly be right. Except when they are not.
In short, where there is no detailed information available from the sensor, it is challenging to extract them.
Super resolved data for this exercise
We have not made super resolved data ourselves this time. We got it as part of the NIVA project from the National Paying Agency of Lithuania. The one-metre “resolution” data was produced by DigiFarm over an area of interest in Lithuania for the second part of March 2022. We are not familiar with the principles behind the production of these deep resolved images, but it seems to use a generative model conditioned VHR (maybe Mapbox/ESRI?), in combination with Sentinel-2 L2A cloudless mosaic (second bullet above).
It certainly looks good.
For fair assessment of the effect that the super resolution product has on the quality of the field delineation results, two models were fine-tuned. Both models started as the field-delineation model developed during the NIVA project and originally trained on 2019 Lithuanian GSAA parcels (see blogpost linked below for more details). One model was fine-tuned using super-resolved imagery while the other one was fine-tuned using the corresponding Sentinel-2 imagery over the same areas. The reference used for training was the 2021 GSAA provided by NPA.
Parcel boundary detection for CAP
Trying to teach machines how to cluster agricultural pixels according to spatial, spectral and temporal properties.
Fine-tuning accomplishes two things: for Sentinel-2, it adapts the model to imagery from the time period, and for the super resolution it adapts the model to the features peculiar to the super-resolved imagery. This allows for a fair comparison between the results obtained by the two.
It is worth noting at this point that increasing the resolution by a factor of 10 does come with a significant increase in computational complexity, which results in increased processing costs and time due to 10² = 100 times more data.
Qualitative analysis of the super-resolved imagery
It’s very hard (if not impossible) to validate the super resolution imagery by using Sentinel-2 only. The use of VHR layer seems to generate features that are not discernible on Sentinel-2 imagery, making it impossible to validate without a reference (real) VHR image from the same period. It does appear that the spatial resolution is improved, but at the cost of contextual clarity — it is not clear whether the data is coming from recent imagery (Sentinel-2) or from a historic one, without a specific time frame. An example of this can be seen below. The arrows on the images point to boundaries that can not be seen on Sentinel-2 imagery but are seen on both the super-resolved and Mapbox imagery. The question remains, was the boundary there when the Sentinel-2 image was taken or not?
Luckily we do have access to VHR data, kindly provided by Airbus, which allowed us to compare the SR imagery with Pleiades imagery taken within a few days of the Sentinel-2 acquisition. The ultimate test of whether the generated imagery represents reality or not.
In majority of cases the SR imagery aligns well with what is seen on Pleiades imagery, either a big nod to the SR process or simply a fact that the world is not changing that fast, so old(er) VHR data will often still be relevant. There are however (many) instances where this is not the case and the SR imagery differs from the underlying ground truth. These can be seen as missing or fictional boundaries on the super-resolved imagery.
This may be acceptable in certain use cases, but having such things happen makes it hard to trust the imagery. One of the points of contention is the fact that the imagery is conditioned on basemaps from an unknown date. This begs the question whether skipping the super resolution step and making delineation directly on ortophoto imagery with a known date of origin would be better. In this case, by knowing where the information is missing or outdated, we could come up with mitigation measures and alternate approaches to handle the lack of information.
Another noticeable feature of the SR imagery are the exaggerated textures especially problematic for the boundary delineation use-case as they lead to confusion between the actual boundaries and the side effects of the imagery generation process. This is reflected in lower confidence scores of the model as it is harder to separate the signal from the noise. We also observe artifacts (horizontal or vertical lines) that are probably the consequence of the image stitching process.
Quantitative analysis of the field delineation results
Assessing the results only visually can lead to wrong conclusions so it is important to quantitatively compare the results and get a clearer picture of the strengths and shortcomings of both approaches.
Usually no ground truth data is available for such an analysis. This was also the case this time, so we decided to create our own and manually validate labels across sample areas in Lithuania. The GSAA polygons across these areas were manually reviewed and redrawn to reflect the state on the ground in the second part of March 2022 (note that we used Sentinel-2 for verification, which might skew the ground truth a bit, but GSAA are generally quite accurate).
Around 9000 polygons are present in these areas and were used for computing the metrics. To ensure the fairness of the validation, these areas were excluded from the dataset used for fine-tuning the models.
Having reliable ground truth allows us to compute and compare different metrics to quantify the differences between the two results.
Intersection with ground truth
First thing we are interested in is the distribution of “How many predictions intersect each ground truth polygon”. We would want the number to be as close to 1 as possible, meaning that we have a perfect delineation.
Predictions on Sentinel-2 perform slightly better on this metric, indicating that the delineation power of the Sentinel-2 imagery is slightly better.
Next, we compute the oversegmentation and undersegmentation geometric metrics based on the A novel protocol for accuracy assessment in classification of very high resolution images; C. Persello, L. Bruzzone paper.
Something to keep in mind about these metrics is:
- False positive fields are not taken into account (i.e. if something is predicted and doesn’t intersect with any ground truth polygon it is ignored)
- Only the prediction with the largest intersection area is taken into account.
Undersegmentation is defined as :
1 - intersection_area / prediction_area.
The metric is 0 if the predicted polygon is fully enclosed inside the reference polygon. It goes towards 1 when the area of the predicted polygon is larger than the reference i.e. the polygon was not delineated enough)
Sentinel-2 performs better when looking at undersegmentation. This means that the predictions are better enclosed inside the reference polygons (and might in practice mean that the predictions are often smaller than the reference).
It indicates that the SR results are less delineated (i.e. the final output contains fewer fields). This is somewhat contradicting the intuition (better resolution should uncover more/smaller parcels) as well as the fact that the total number of polygons inside the labeled areas is significantly higher for the SR results (12k (S2) vs. 27k (SR)), but looking at the data we see that there are many more predictions over non-agricultural areas (remember that false positives are not taken into account). The relevant number is the total number of predicted polygons that have an underlying ground truth polygon which is slightly higher for S-2.
Oversegmentation is defined as:
1 - intersection_area / ref_area
The metric is 0 if the reference polygon is fully covered by the prediction. It goes towards 1 if the area of the predicted polygon is smaller than the reference i.e. the polygon was too delineated.
For oversegmentation, the situation is more clear. The SR results are noticeably better than the Sentinel-2 results. A lot of this seems to be just due to better spatial resolution and alignment between the ground truth and the predictions, i.e. the predictions are better aligned with the reference.
Does this depend on polygon size? We can take a look at the below histogram displaying the relationship between the size of the reference polygon and the difference between the oversegmentation score of the predicted polygon between Sentinel-2 and SR results. It indicates two things, first the SR results universally have a better oversegmentation score regardless of size and the advantage of SR is bigger for smaller polygons.
For each reference polygon we are interested in some core information:
- What is the relative area of reference polygon (FOI) covered by the prediction?
- What is the relative area of the predicted polygons covered by that reference polygon?
- Number of different predicted polygons that the reference intersects.
Plotting these parameters as a scatter plot helps us define 4 distinct quadrants that group FOIs with similar behavior together.
Quadrant 1 (Q1):
These are the reference polygons that are well covered by the predictions. If the number of intersecting polygons is 1 this means that the parcel is approximately the same, if the number of intersecting polygons is higher than 1 this means that the parcel has been split. On the scatterplot the number of intersections is shown as the color of the dot.
Quadrant 2 (Q2):
If the number of intersecting polygons is 1, it means that the predicted polygon is larger than the reference and that the reference polygon is mostly contained in the predicted one. If the number of intersecting polygons is higher than 1 it indicates that we have multiple predictions that cover the reference polygon.
Quadrant 3 (Q3):
If the number of intersecting polygons is 1, it means that the predicted polygon is smaller than the reference. The predicted polygon being mostly contained in the reference one. If the number of intersecting polygons is larger than 1 it means that the reference polygon has been delineated into smaller parts which do not fully cover the reference.
Quadrant 4 (Q4).
Denotes polygons that poorly overlap, with areas from either reference or predicted FOI not covered by predicted or reference polygons.
We can compare the field delineation results based on the proportion of parcels in each of the quadrants.
The biggest difference we observe is the larger proportion of polygons in Q2 for the super resolution results compared to the Sentinel-2 results. The predictions are bigger than the reference, indicating that the super resolution fields are not delineated enough. The decrease in other quadrants is proportional.
We can further split the quadrants based on the number of intersections:
The biggest jump in the second quadrant seems to be reference polygons with only one intersection. For easier understanding, below is an example of such a polygon that illustrates how these cases look.
This shows the effect that the decreased model confidence has on the final results.
Intersection over union
Finally, we can compare the two predictions using the intersection over union metric. We calculate it as follows:
For each ground truth polygon:
union: Find all the predictions intersecting with the polygon. Make an union with the ground truth polygon and calculate the total area.
intersection: Calculate the area of ground truth polygon that is covered by the predictions.
iou = intersection / union
First we notice a higher peak at 0 for S-2 data (reference polygons without any prediction) — a design limit of our delineation ML model. These are mostly cases like the one shown in the example below (marked in green). Smaller polygons in and around built-up areas which the model has learned to ignore due to their proximity to built-up areas probably because green surfaces inside cities (i.e. parks) are not included in the training dataset.
Similarly, we get reference polygons without predictions for super-resolved results. Because the super resolution model is predicting fields over everything (and hasn’t learned to distinguish between arable and non-arable land as the Sentinel-2 did) we get predictions over all land types. This results in very large polygons over non-arable land (such as forests), which can cause issues in the post-processing steps.
What we also notice on the histogram is that the IOU score for polygons that seems to match best (IOU close to 1) are higher for the SR results. This is aligned with the hypothesis that this is due to better alignment with the shape of the reference due to increased spatial resolution.
For each of the reference polygons, we calculate both the intersection over union with the predictions made on Sentinel-2 imagery and with predictions made on super-resolved imagery, then compute the difference between both of the scores. Positive value of the difference indicates that the prediction on Sentinel-2 is better and negative value means that the prediction on super-resolved is better.
We can than compare the overall scores by plotting the histograms of the differences.
On the left histogram, there is a peak around zero indicating that for the majority of reference polygons the IOU scores for both SR and Sentinel-2 results are similar. The cumulative histogram (right) shows that there is no clear advantage for predictions on either Sentinel-2 or on super-resolved. The value at zero is almost exactly 0.5 indicating that half of the polygons have a positive difference (score on Sentinel-2 is better) and half have a negative difference (score on super-resolved is better).
One last analysis we can do is to look at the difference in relation to the area size where it does seem that for smaller polygons the SR results are better (negative score) and this changes when the parcel size is increased.
Using super-resolved imagery for field delineation does improve the spatial alignment of the boundaries of predicted fields with the boundaries of the reference data but it also decreases the interpretability of the results and makes it harder to know whether the results reflect what is actually on the ground during the selected time period.
Having no visibility into the source of specific feature (Sentinel-2 or VHR) it would probably be better to make one delineation using VHR/orthophoto alone, another one using Sentinel-2 stack and then compare these two on the level of vectors, i.e. fitting the (up-to-date but not as accurate) Sentinel-2 vectors to the (outdated but detailed) VHR/orthophoto ones using some simple business rules taking into account the differences and the “age” of the VHR data. Such a process would produce (at least) the same level of spatial accuracy/detail but would also have a full audit trail, so one can evaluate individual features. It should also be investigated whether improved spatial alignment of the results is possible by adapting the model architecture to do on-the-fly upscaling while only using Sentinel-2 imagery and reference vectors rasterized to a higher resolution to try and get the best of both worlds.
The exaggerated textures of the SR make it harder for the model to separate between the signal and the noise. The outcome is that the predicted fields are less delineated when using super-resolved imagery compared to Sentinel-2. These negative effects could be due to the nature of the SR data, but they could also be due to insufficient fine-tuning of our field delineation model on this imagery. Even if the latter is the case, the additional computational complexity when using data at one meter has to be considered.
It might also be that our field delineation model is simply not good enough for the SR data. The company producing the SR is offering parcel boundaries as well. It might be worthwhile to add their polygons in the mix as well, to check if it changes the assessment. For next time….
Big thanks to DigiFarm and National Paying Agency of Lithuania for the data, to allow us to do this experiment.
Thanks for reading. Get in touch with us at email@example.com for any question or comment about our analysis.
This post is one of the series of blogs related to our work in Area Monitoring. We have decided to openly share our knowledge on this subject as we believe that discussion and comparison of approaches are required among all the groups involved in it. We would welcome any kind of feedback, ideas and lessons learned. For those willing to do it publicly, we are happy to host them at this place.
- High-Level Concept
- Data Handling
- Outlier detection
- Identifying built-up areas
- Similarity Score
- Bare Soil Marker
- Mowing Marker
- Pixel-level Mowing Marker
- Crop Type Marker
- Homogeneity Marker
- Parcel Boundary Detection
- Land Cover Classification (still to come)
- Minimum Agriculture Activity (still to come)
- Combining the Markers into Decisions
- The Challenge of Small Parcels
- The value of super resolution — real world use case (this post)
- Traffic Light System
- Expert Judgement Application
- Agricultural Activity Throughout the Year