# How To Normalize Satellite Images for Deep Learning

**Tackling the long-tailed **satellite imagery data in **deep learning applications**

Written by *Nika Oman Kadunc. *Work performed by* **Devis Peressutti**, **Nejc Vesel**, **Matej Batič**, Sara Verbič, Žiga Lukšič, **Matej Aleksandrov** *and* Nika Oman Kadunc.*

*Normalization of input data for deep learning (DL) applications is an important step that impacts network convergence and final results. In case of long-tailed satellite signals, proper normalization can be quite a challenge — we were tired of trying to understand why the models we trained on one location didn’t always translate to another location as well as we thought they should — so we set out to explore what kind of normalization schemes are most suited for the task.*

**Introduction**

Deep-learning-based automatic field delineation from satellite images is becoming an important tool in large-scale evaluations and monitoring of land cover and crop production. One of the steps in the workflow is normalization of the band values, which impacts network performance and quality of the results.

The aim of this study is to investigate and quantify the effects of several normalization methods on the performance of our existing field delineation algorithm. In addition, we want to assess the feasibility of using a single set of normalization factors for large scale applications, performing well under various types of variability in band distributions. For this purpose, it is necessary that the training dataset includes imagery from a larger geographical region which captures the reflectance variability and a large time period (whole year) to capture the seasonal variability.

Proper normalization of the images is a step often underestimated, although it is essential to the DL algorithms. Typically, the input images are normalized so that the mean is centered at 0 and the standard deviation at 1 [1][2]. This assumption holds when the distributions are close to normal but is less suited to the case of reflectances or digital numbers (DN), as they are 0-bounded and long-tailed. In addition, saturated DN values represent outliers which can greatly affect the statistics computed. In this study, we aim to investigate different normalization methods that would be more suited to the properties of the satellite imagery data and would allow to center the distributions and reduce the impact of outliers.

We present our investigations of satellite image histograms, different normalization methods and their effect on the results of field delineation across the whole region of Europe. First, we present the dataset of satellite imagery obtained for the purpose of this study. Next, we investigate the effects of image histogram variability according to land type, geographical location and time period. As we are interested in field delineation on agricultural land, we focused on the variability of cropland according to geographical location. We then present three methods of histogram normalization and compare the results for automatic field delineation.

Although we focus on our algorithm for field delineation, the findings presented here could be applicable to different large-scale applications based on machine learning, such as land cover, crop classification and super-resolution.

**The dataset**

The dataset was designed to capture the variability of different geographical locations across Europe and of different time periods. Small patches were selected to give a better spatial sampling for a given total sampled area. Sentinel 2 L1C bands of 10 000 randomly distributed patches of size 256 × 256 pix at a 10 m resolution across Europe and some neighboring regions for a whole year were obtained. The patches correspond to 0.66 % of the given European AOI and are shown in Fig. 1 (top). Bands B2, B3, B4 and B8 were selected for analysis. The images were filtered to remove snowy and cloudy acquisitions using s2cloudless and the snow masking available in eo-learn. After filtering, a total number of 180M pixels constituted the dataset. ESA World Cover data was also downloaded for each of the patches to obtain information on the land type (i.e. tree cover, cropland, water, etc.). An example of a patch and the corresponding land cover data is shown in Fig. 1 (bottom).

The entire workflow was implemented using the functionalities of Sentinel Hub, eo-learn and eo-grow.

**A look into the dataset**

We explored the properties and the variability of the histograms of considered bands in terms of different parameters: land cover, geographical location and time period.

**Land cover exploration**

Firstly, the distribution of sampled pixels in terms of land cover was investigated and is presented in Fig. 2. The ESA Land Cover does not provide intra-year temporal data, so we have taken single time frame of the Sentinel-2 data into account for this exploration.

We see in Fig. 2 that tree cover is the most represented class of land cover in the sample dataset, followed by grassland and cropland. Bear in mind that these distributions are subjected to the classification error of World Cover, so the actual values might slightly differ.

Next, we explore the contributions of different land cover classes to the whole band DN values histograms of pixels (Fig. 3). Remember that for Sentinel-2, reflectances are obtained from DN values by dividing them by 10 000. The logarithmic scale of the data is added for better visibility and easier comparison. We see from Fig. 3 that water dominates the left part of the histograms (bands B3, B4, B8). Cropland class, which is of most interest in this analysis, lies in the mid part of the histogram, strongly overlapping with grassland in bands B2, B3 and B4. The information about relative position of specific land cover classes within the whole histogram can be important when choosing the appropriate normalization method that can affect different parts of the histograms.

**Geographical exploration**

For the purpose of geographical variability analysis of the band DN values, a partition of the Europe AOI was made into different regions (depicted in Fig. 4). Five distinct regions as divided by the partition grid were selected for comparison of variation of the histograms, shown in Fig 5.

We see in Fig 5 that the variability in the DN values is much more correlated to the latitude of the region than to the longitude as the histograms from regions with similar latitudes show the most similarity. The difference is most apparent in the band B4 values.

To analyze the geographical (locational) variability of the agricultural land across the Europe region, a comparison of only the cropland land cover class was made and is shown in Fig. 6. Again, differences according to region latitude can be observed in the cropland histograms. For instance, the blue (Spain) and purple (Turkey) histograms are very similar, and greatly differ from the red (Baltic countries) and orange (United Kingdom) histograms, which are however similar between themselves. The green histogram (Hungary) lies between the two pairs mentioned above. These differences mean that the normalization factors computed for one region might not be equally suitable for normalization of a region with a different latitude.

## Temporal exploration

Finally, an analysis of temporal differences in band values was made. The dataset was divided into monthly time periods and the distribution of samples with regards to months is shown in Fig. 7. We can see that when filtering the region of Europe with the snow and cloud mask, acquisitions in winter months get filtered out more, resulting in a distribution with a peak in July.

Fig. 8 shows the temporal variability of the band DN histograms. In all bands, the discrepancy is the most apparent between the months of January and July, due to the difference in the presence of vegetation. These differences are the most discernible in the B4 and B8 bands, which strongly reflect changes in vegetation.

The temporal variability of satellite imagery is very important when choosing the time period for DL purposes, especially for field delineation. The fields change significantly with the seasons and are subject not only to natural changes but also cultivation activities.

**Histogram normalization**

Normalization of the input data is very important because the network training convergence depends on the input values; it converges faster if its inputs are transformed to have zero means and unit variances [1]. This is called input histogram normalization or standardization. It allows the network to operate in a good range, since it is usually initialized with random weights with 0 mean. Normalizing images with regards to standard deviation prevents the gradients from exploding, which could happen if values of the computed features are too large, making the convergence of the network more difficult.

While histogram standardization may be the best choice for many cases where the input data follows a nearly normal distribution, in the case of DNs, where the band distributions are long tailed and 0-bounded, applying standardization does not give the desired properties of the data for the network operation. This issue has already been identified and addressed by applying a different normalization function to the band data [3]. In this study, we will test three different methods of normalization that aim to give a better-balanced data for the network operation in field delineation.

**Linear normalization**

The linear normalization yields a re-mapping of the given range of input values to span across a different range, more appropriate for the task at hand. It is performed with the application of a linear scaling function to the band DN values:

where *a* and *b *are the lower and upper limit of the resulting range and *c* and *d* are the lower and upper values of the input range. The resulting range of values was chosen to be between 0 and 1 and the scaling can be performed choosing different values for *c* and *d*. One option is to simply take the *min* and *max* values of the input data. The problem, especially with long-tailed signals, is that a single outlier value can greatly affect the value of *c* or *d* which can result in a very unrepresentative scaling. A more robust approach is to select *c* and *d* at 1st and 99th percentile of the value histogram — this reduces the impact a few outliers can have on the scaling.

We tested the linear normalization with three different sets of parameters:

*c, d*as*min*,*max**c, d*as 1st and 99th percentile*c, d*as 1st and 99th percentile with the range bounded between 0 and 1

Some examples of band histogram transformations with the linear normalization are presented in Fig. 9.

We see in Fig. 9 that with the *c* and *d* as *min*, *max* (top right), the range of the histogram values are transformed to the interval [0,1], but the long-tail shape of the histogram stays the same. This means that the effective range is reduced to a smaller interval, for instance, for band B2, between 0.02 and 0.08. On the other hand, with the *c, d* as 1st and 99th percentile with no bounding, the mid-part of the histogram is centered to [0,1] and the lower and the upper 1 % of values are extended beyond this interval, retaining the long-tail shape of the distribution. With the use of bounds, the whole range of the histogram values lies within the interval [0, 1], where the lower and the upper percent of values are squeezed (condensed) in the extreme (first or last) histogram bins. Bounding can introduce some information loss. Since these transformations are linear, the shape of the original distribution is maintained.

**Dynamic World normalization scheme**

We tested the normalization scheme introduced in [3]. First, the log-transform is used on the original signal in order to deal with the imbalanced long-tailed values. Next, the 30th and 70th percentiles of the log-transformed signals are remapped to points on a sigmoid function. This bounds the resulting histogram range to the interval [0,1] and squeezes (condenses) the extreme values to a smaller range [3]. To experiment with the effect of the log-transform on the normalization, we additionally tested the scheme without the log-transform and also with different percentile values for the remapping. The four parameter sets used were:

- 30th / 70th percentile with log transformation
- 30th / 50th percentile with log transformation
- 30th / 95th percentile without log transformation
- 20th / 95th percentile without log transformation

Examples of transformed histograms are presented in Fig. 10.

Comparing the log and no-log normalizations in Fig. 10, we see that log has the effect of retaining the flat tails of the histogram and the no-log normalization squeezes (condenses) the values of the long histogram tail, similar to the bounding in linear normalization. Using different values for the mapped percentiles gives slightly different shapes of the resulting normalized histograms. In addition, these non-linear transformations change the original distribution of the band values.

**Histogram equalization**

Unlike linear normalization, histogram equalization is a type of histogram modeling technique that applies a non-linear mapping between the input and the resulting signal and offers a way to obtain any desired histogram shape. Histogram equalization defines a mapping based on the cumulative histogram and re-maps the input (in our case long-tailed DN histogram) to a uniform distribution. It increases the contrast by spreading out band values to the entire output range. We used 40 000 bins for each band to construct the cumulative distribution of the Europe dataset to obtain the mapping function. Fig. 11 shows the resulting uniform distributions obtained for each of the bands after histogram equalization of the dataset histograms.

**Visualization of the normalization methods**

For comparison, we can visually present the effect of the three normalization methods. Fig. 12 shows some examples of Sentinel L1C RGB images (bands B4, B3, B2) and false color images (bands B8, B4, B3) under different normalization transformations. An advantage of using an output range between 0 and 1 is that we can visually assess and interpret the effect of normalization, which would be more challenging if we used, for instance, a range of [–1,1].

As we can see in Fig. 12, the linear normalization with *min* / *max* as *c* and *d* has no effect on the image appearance, as the values get shifted, but retain all the properties of the original band histograms. With the use of 1st and 99th percentiles as *c *and *d*, the contrast of the image is improved, both with and without bounding. The Dynamic World log and no-log transformations have a more dramatic effect on images, since they change the shape of the band histograms. We see that the differences between vegetated and non-vegetated land are additionally enhanced, which is most apparent in the false color images showing the B8 band in red. This effect could be beneficial for many image processing or DL applications aiming at distinguishing vegetated from non-vegetated regions. Histogram equalization has the most dramatic visual effect on the images, which is not surprising as the histograms get re-shaped from a narrow-peaked and long-tailed histogram to a uniform distribution.

Although these transformations are interesting to a human eye, the effect on the network performance is not always predictable and straightforward. So we further investigated which of these transformations is the most appropriate for our field delineation application.

**Field delineation experiments**

To explore the effect of different normalization methods on the training of the DL architecture, we set up a set of experiments with our existing field delineation algorithm. We used a subset of the ai4boundaries dataset for training of the model and a UNET with randomly initialized weights as a base model. The training of the model was performed over 4 epochs for each of the normalization methods.

The results of the experiments are first compared in terms of convergence of the network through evaluation of the losses for the training and the validation set. These are presented in Fig. 13.

We see in Fig. 13 that the linear normalization with *c, d* as 1st and 99th percentile gives the best results for training and validation loss and is closely followed by its bounded version. The worst result is obtained using the linear normalization with *c, d* as *min* / *max*, although faster convergence can be observed compared to other methods. Histogram equalization and the non-linear Dynamic World normalization scheme in all its tested forms yield comparable results in this test run.

While the loss for the three linear normalization methods is comparable between training and validation datasets, this is not the case for the non-linear methods. For these, the validation loss is not monotonically decreasing, which might indicate less stable convergence.

The performance metrics in terms of intersection over union (IoU), accuracy and Matthews correlation coefficient (MCC) were also computed and are presented in Fig. 14. They show similar behavior and ranking of the normalization methods, with the linear normalization with *c, d* as 1st and 99th percentile giving the best results and the linear normalization with *c, d* as *min* / *max* performing the worst.

It’s worth noting that 4 epochs may not be enough to draw conclusions on the final convergence state and possibly all methods might reach the same performance given enough time. However, our analysis already shows that some methods lead to a sharper and more stable convergence rate than others.

We see that the choice of the appropriate normalization method can affect both the convergence in the training and validation phase as well as the final results. The choice is not straightforward, though, as similar methods can give quite different results as we see in the case of different types of linear normalization and vice versa; substantially different normalization can produce rather similar results in terms of network convergence and the final performance. The observed results of our experiments indicate that mapping the main part of histogram data into the interval [0, 1], but moving outlier values out of this interval (by the use of 1st and 99th percentile in the linear normalization) has a large positive effect on the network convergence and performance.

**Conclusions**

We explored locational and temporal variability of satellite imagery band data and found that latitude is the most important locational parameter affecting the DN band histograms in our study area, probably because of its effect on climate and vegetation. Also, vegetation changes throughout the seasons contribute the most to the temporal variability. Both effects are also reflected in the cropland histogram, which is especially important for field delineation purposes.

We chose three different types of histogram normalization with different parameters and investigated the impact on the resulting DN band histograms. Despite the fact that the visual effects can be quite dramatic in the case of non-linear band histogram transformations, convergence and performance of the network are not affected by these changes. Rather, the results of our field delineation experiments showed that even small modifications to the same method (e.g. how we transform outlier values) can have a much larger impact on the convergence and the performance of the model.

# References

[1] Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” *International conference on machine learning*. PMLR, 2015.

[2] Wiesler, Simon, and Hermann Ney. “A convergence analysis of log-linear training.” *Advances in Neural Information Processing Systems* 24 (2011).

[3] Brown, Christopher F., et al. “Dynamic World, Near real-time global 10 m land use land cover mapping.” *Scientific Data* 9.1 (2022): 1–17. https://doi.org/10.1038/s41597-022-01307-4

TThe project has received funding from European Union’s Horizon 2020 Research and Innovation Programme” under the Grant Agreement 101004112, Global Earth Monitor project.