Multi-temporal Super-Resolution on Sentinel-2 Imagery

Deep dive into enhancing the spatial resolution using deep learning, including tips, tricks and technical details.

EO Research

Published in

Sentinel Hub Blog

14 min readMay 18, 2021

Written by Nejc Vesel, Jernej Puc, Matej Batič and Devis Peressutti

Sinergise is a partner of the Dione project, where one of the missions is using novel techniques to improve the capabilities of satellite technology while integrating various data sources, such as very high resolution imagery, to, for example, enable monitoring of smaller agricultural parcels through the use of super resolution models. The idea is to combine the advantages of Sentinel imagery (price, temporal resolution) with the increased resolution of commercially available very high resolution imagery. Complementing our work on Area monitoring, this blog post summarizes our efforts in this direction and the lessons we have learned during the process.

Intro

Gaining insights from Sentinel-2 imagery mostly works out great for us. The temporal resolution allows to focus on the time-series aspect of the data while its spatial resolution is sufficient for the majority of use-cases. Sometimes, though, 10 meters just isn’t enough. One can resort to commercially available very high resolution (VHR) imagery, but monitoring large areas over longer time intervals with such imagery gets expensive rather fast, if feasible at all (capacity of the VHR sensors is much lesser, for obvious reasons, than at medium resolution).

This is where super-resolution (SR) techniques can help. Although VHR imagery is still needed to train the model, the amount required is significantly reduced. While the resulting images are not as good as native high resolution images, they can still provide an approximation that can bridge the gap between cost and quality.

In this exercise we tried to super-resolve Sentinel-2 bands from their native 10m to a 2.5m ground sample distance (GSD), therefore a 4x resolution enhancement.

Data preprocessing

In our experience, the best way to improve the performance of the model is to focus on the underlying data that is used to train it. Oftentimes, tweaks in model architecture have a marginal effect on the real-world performance, while working on the data itself tends to improve all models, regardless of the architecture and hyper-parameters used. A lot of our efforts thus focused on transforming the data into a suitable format and making it as “clean” as possible.

Download

As part of the Dione project, we obtained approximately 50 Deimos-2 tiles covering 12.000 square kilometres across 4 different timestamps (July — October) in areas over Lithuania and Cyprus.

For ease of access (and not having to worry about things like interpolation, stitching, reprojection, etc.) we have ingested the imagery to Sentinel Hub as Bring Your Own Data.

The areas over which the data is available were divided into overlapping (2.4km × 2.4km) EOPatches containing:

Sentinel-2 bands (R, G, B, NIR) at 10m resolution for the period between May and October.
Deimos-2 bands (R, G, B, NIR) between July and October (period of availability) pansharpened and resampled at 2.5m spatial resolution.

Normalization

The Deimos imagery can exhibit significant differences in the brightness and colors of the imagery. Some of these are inherently due to the underlying terrain but there are also other factors that contribute to the tiles looking different, such as atmospheric conditions, viewing angles, etc.

For training the model, we want to normalize the data, to achieve consistency of samples across different Deimos tiles.

The metadata available for each Deimos tile gives us the standard deviation and mean statistics per band. However, in our previous projects, we have often seen that using median instead of mean for normalization helps the model be more robust to outlier values, so we calculate median values per tile.

When comparing tile statistics, we found large differences between tiles, even after normalization. We realized that the issue is caused by, who would have guessed, clouds. Tiles where a large percentage of the area is covered by clouds have high mean, median and standard deviation values. This seems natural — clouds are bright and there is also a large contrast between cloudy and non-cloudy areas.

Blow the clouds away

As anyone who works in earth observation knows, clouds are always trying to spoil the party . Normally, we work with Sentinel-2 imagery and in those cases we resort to our s2cloudless algorithm for cloud detection (and removal). In this case, however, we have to get creative. By analysing the band histograms for random samples, we used a threshold of 100 on the blue radiance band. We then generated cloud masks and calculated cloud coverage, focusing further analysis on timestamps with maximum cloud cover of ~0.1.

Cloud filtering on Deimos imagery is done by simple thresholding of the blue band.

Removing the clouds resulted in significantly smaller differences between normalization factors of individual tiles. However, this only affects the labels — we still need to normalize the features (Sentinel-2 imagery). Since the Sentinel-2 data are much more consistent across different acquisitions we calculated cloud-free data normalization factors per country.

To better fit the support of the activation functions, the bands should also be rescaled to the range [0, 1]. The 1st and 99th percentiles for each band were calculated and the values stored. This was done both for labels and for the features.

Training set curation

Even after cloud filtering the data, several problematic samples remain that can negatively impact the training. When looking at our samples, we noticed a large number of the following problems:

cloud shadows in Deimos imagery
orthorectification on Deimos images can cause some parts of the image to match well with the corresponding Sentinel-2 image and others not so well
cloud shadows in Sentinel-2 imagery

Shadow filtering

Cloud shadows that cover ground truth data are quite prominent, especially in Lithuania. They introduce noise that hinders the model learning. It is also harder to evaluate the performances of the model, as the available metrics do not make sense when comparing a cloudless super-resolved image to a cloudy or shaded ground truth label. We tried thresholding and peak detection, but they didn’t prove to be very good. Using our reasonably performant simple cloud detector for Deimos imagery, we decided to perform shadow detection based on the fact that shadows occur near clouds.

To obtain samples containing shadows we search for EOPatches containing observations with more than 10% cloud coverage, and mark as “containing shadows” the neighbouring EOPatches’ observations from the same time (indicating they come from the same tile). This approach is very focused towards finding as many true positives (samples containing shadows) as possible. The cost of this is an increased number of false positives, but our data is large enough that we can afford this.

We remove approximately 5,000 samples out of 23,000, which is ~25% of all samples. Looking at the results manually, we estimate around 60% true positive rate and only about 2% false positive rate.

Cleaning remaining noise

Noisy samples remained, even after cleaning up the shadows. To identify these, we computed the mean square error (MSE), structural similarity (SSIM) and peak-signal-to-noise-ration (PSNR) metrics between the Deimos (labels) and the bicubicly upsampled Sentinel-2 images.

By plotting SSIM and PSNR (MSE and PSNR are logarithmically the same thing), we could see a bi-variate distribution that showed some positive correlation, meaning that higher structural similarities were associated with higher PSNR values.

By plotting SSIM and PSNR values, we can identify problematic samples.

In the scatter plot we can distinguish (three) clusters of points, particularly for Cyprus data. The samples from the lower left cluster contain many clouds, cloud shadows and non-rigid deformations, the middle cluster contains many normal looking samples, while the upper right cluster contains many samples with no structure, only texture (i.e. sample over a very textured forest area). Identifying these issues allowed us to filter out noisy samples before training by setting a filter on the SSIM and PSNR values.

Model, finally

The image SR is a tough cookie, as it’s an inherently ill-posed problem, which means that multiple “correct” solutions may exists . The main concern therefore was making up unrealistic and unreal high resolution information, that does not faithfully represent the underlying ground truth. This is particularly important for the monitoring use case.

In exploring the available options, we chose to focus on multitemporal SR, as opposed to single-image SR, to allow the model to learn co-registration errors between the temporal acquisitions and thus to extract sub-pixel information from the temporal stack of images.

We started with the HighRes-Net architecture and code, which we modified to fit our data preparation pipeline. We chose the model because it performed well in ESA’s Kelvin competition for multi-frame super-resolution on Proba-V imagery. It attempts to solve the multi-frame SR (MFSR) problem through an end-to-end approach to the main sub-tasks involved in the process: (i) co-registration, (ii) fusion, (iii) up-sampling, and (iv) registration-at-the-loss.

The architecture can be divided into three main parts — encoding, decoding and registration. In general, the reconstructed super-resolution image will be shifted with respect to the ground truth. To solve this, the ShiftNet architecture (inspired by HomeographyNet) is used to estimate the (∆x, ∆y) shift and thus improve the loss. Lanczos resampling is used since it allows for sub-pixel shifts.

An important caveat to note here is that while we frame the problem at hand as a multi-frame SR problem, in reality many of the underlying assumptions in MFSR do not apply in our case, namely:

low resolution and high resolution images were acquired using different sensors, i.e. Sentinel-2 and Deimos-2, therefore the underlying acquisition parameters are different, e.g. spectral bands, point-spread functions;
the underlying objects being imaged, i.e. land cover, change between consecutive low-resolution frames and high-resolution frames. This is due to the revisit frequency of the Sentinel-2 images, the cloud presence and the overlap with Deimos-2 acquisition times.

All of these peculiarities add to the complexity of the SR. Read on to see how the complexity was tackled.

Training

We iteratively increased the complexity of the model and the problem at hand to understand its behaviour, starting from super-resolving a single-channel image and then moving to four-channel images. Our ablation studies to characterise the model can be broadly divided into the following:

architecture optimisation, e.g. number of Sentinel-2 images, number of encoding layers, number of convolutional filters, pixel-shuffle or deconvolution upscaling layer;
band distribution of super-resolved images. We mentioned above that we are dealing with different input and target imaging sensors, i.e. Sentinel-2 and Deimos-2. Typically the output of the model would try to match the distribution of the target image, i.e. Deimos-2, but what we actually want is an image that matches Sentinel-2, so that it could be directly used in existing downstream applications;
loss function. This choice is paramount for the success of the SR task, but we soon realised how unsuitable the common pixel-wise losses (e.g. MAE, MSE, SSIM) are to truly quantify the quality of the super-resolved image. Furthermore, these losses can be completely unrelated to the downstream application that will make use of the super-resolved images.

For all our ablation studies we used WANDB to allow model comparison and characterisation. Below we report the methodology in regards to the introduction of perceptual loss and histogram matching, which both proved beneficial by the ablation studies.

Histogram matching

One way to make the target Deimos-2 imagery look like Sentinel-2 while maintaining the high frequency components we want to learn is to use histogram matching. This technique is a common pre-processing step in many computer vision tasks, and has also been used in super-resolution tasks for remote sensing in Galar et al.. In practice, we histogram-match the Deimos ground truth target image to the bicubically upsampled image of the Sentinel-2 image closest in time.

Green band distribution at the output of the model. The super-resolved distribution seeks to match Sentinel-2 as opposed to Deimos-2.

This works well if these two images are very similar to each other, however artefacts such as clouds, haze, shadows or very bright objects skew the histogram distribution and therefore the matching. To compensate for this imperfect matching, to really convince the model to match the Sentinel-2 band distributions, we added a simple loss term that seeks to match the mean values of the super-resolved image and the temporally closest Sentinel-2 image for each band.

Perceptual loss

Despite being very popular and widely used in super-resolution tasks, MSE, PNSR and SSIM functions don’t represent how well the high-resolution information has been learnt by the model and often better scores are achieved by blurrier images. To counteract this effect, a perceptual term can be added to the loss, as proposed in Johnson et al.. This is usually done by feeding the results of the SR model to another pre-trained network and then comparing activations of the internal layers between the (super-resolution) model output and the ground truth. When working with natural images, this pre-trained network is typically one of the standard pre-trained image classification networks (such as VGG, ResNet etc.). Using these models is not suitable for our remote sensing application, both conceptually and practically, as we deal with at least 4-channel images.

A better choice for our use-case is the field delineation model, since it represents high-order features tailored for a downstream application. We retrained the field-delineation model on Deimos imagery at 2.5m spatial resolution in order to learn associated high-resolution features, hoping to push the HighRes-Net to output imagery that has more high-level detail. The perceptual loss term that calculates MSE between the ground truth activations and the super-resolved activations was added to the training loss. Adding this term increased the sharpness of the output noticeably.

Comparison between Sentinel-2 and super resolved imagery.

Quantitative results

Quantitatively evaluating the results of our model has proven to be a significant challenge in itself. The main reason is the lack of appropriate ground truth data with which to compare. While we have access to Deimos imagery that we use to train the model, there are some problems when it comes to using it for evaluation:

Our model is trained to match Sentinel-2 imagery, thus the bands have different distributions than Deimos (penalizing the validation scores).
The effect of temporal changes over the observed area and the differences in the acquisition times. Usually the difference between the time of the closest Sentinel-2 acquisition to the Deimos acquisition is in the order of few days — during which vegetation over the area can exhibit changes (natural growth, agricultural activity etc.). We know that the model matches the features of the latest Sentinel-2 image, but if they do not represent the state of the field that is present on the Deimos imagery, the computed scores do not reflect the information we want them to.

The first issue can be reduced by histogram matching the ground truth Deimos image to the bicubically upsampled closest Sentinel-2 timestamp-While this is not perfect, Sentinel data at 2.5m resolution does not exist, it is a compromise we need to accept. Similarly, we can’t do much about the second problem, it is just something that we need to keep in mind when evaluating these results.

Calculating pixel-wise and perceptual scores, we see that the SR model outperforms bicubic upsampling for the majority of samples.

Comparison of super-resolution method (HRN) with the baseline (bicubic upsampling). Higher score is better.

Comparison of super-resolution method (SR HRN) with the baseline bicubic upsampling (S2 BIC) using two perceptual losses (feature and style). Lower score is better.

From a qualitative point-of-view, by visually inspecting the generated results we can observe the following:

the sharpness increase can be particularly noticed for high contrast features like roads, but less so for low-contrast features like agricultural land;
the model is able to learn texture, e.g. of trees in forest. Unfortunately, it also learns to predict shadows due to low sun elevation angles, which are more prominent in Deimos-2 since the images were acquired on average 2 hours before Sentinel-2 images. This can be noticed when looking at taller structures like buildings, or trees. This is not a likeable feature, but is tied to the differences in acquisition parameters between Sentinel and Deimos. Ideally, for this task, one would want to minimize such differences;
the model takes as input a temporal sequence of cloudless Sentinel-2 images (i.e. 8 frames) that can span over multiple weeks. However, the model predicts super-resolved images that contain features from the latest Sentinel-2 image. This means that we can predict a super-resolved image for each cloudless Sentinel-2 image in a rolling-window fashion.

Impact on downstream applications

Making super-resolution models is fun, but is there any noticeable benefit to the downstream applications from this activity?

To try to answer this question, we decided on an experiment using our field-delineation model. The idea was that we train two new field-delineation models, one with bicubically up-sampled Sentinel-2 imagery and one with the super-resolved Sentinel-2 imagery. Comparing the scores obtained by these two when predicted on the same imagery should give us some information whether any additional value was added with the super-resolving process.

On the raster level, the scores have noticeably increased, especially for the metrics computed on the boundary.

Field delineation Matthews Correlation Coefficient for extent, boundary and distance between the model trained on bicubically interpolated images (S2 BIC) and the one trained on super resolved imagery (S2 HRN)

On the vector level, we have computed three geometric metrics (over-segmentation, under-segmentation, border error, lower is better) over a test area. We see a noticeable improvement in over-segmentation and a slight worsening of the scores in regard to the under-segmentation and border error, which points towards the model being slightly better when recognizing smaller parcels.

Geometric metrics computed on the post-processed vectorized predictions for field delineation trained on bicubically up-sampled imagery (S2 Bicubic 4x) and on super-resolved imagery (SR HRN).

When trying to interpret these results, one has to keep in mind that to obtain the vectors and the associated metrics, the model outputs are post-processed in various ways (temporal merging, contouring, vectorization) and all of these parts are configurable and affect the metrics. Outputs from both models were post-processed with the same configuration which is not necessarily optimal for that specific model.

Do it yourself

We are always trying to make our work accessible and usable by others so we published the code used. We encourage anyone to try and make them work on your data and area of interest. While our processing is tailored to Deimos imagery, which is not publicly available, any other VHR imagery can be used with some adjustments to the workflow.

sentinel-hub/multi-temporal-super-resolution

This repo contains code to train a multitemporal super-resolution model for Sentinel-2 imagery using Deimos.

github.com

Our next goal is to test how using super-resolution imagery works together with our markers in the area-monitoring context. This is an equally challenging task itself and it will take some time so keep an eye on our blog for any updates. To be continued…

Meanwhile, if you have found this interesting, want to discuss this with us or just get in touch with any questions you might have, please write us at eoresearch@sinergise.com.

TThe project has received funding from European Union’s Horizon 2020 Research and Innovation Programme” under the Grant Agreement 870378.