# Area Monitoring — Similarity Score

*Learning from peers*

*This post is one of the series of blogs related to our work in Area Monitoring. We have decided to openly share our knowledge on this subject as we believe that discussion and comparison of approaches are required among all the groups involved in it. We would welcome any kind of feedback, ideas and lessons learned. For those willing to do it publicly, we are happy to host them at this place.*

*The content:*

*High-Level Concept**Data Handling**Outlier detection**Similarity Score**(this post)**Bare Soil Marker**Mowing Marker**Crop Type Marker**Homogeneity Marker**Parcel Boundary Detection**Land Cover Classification (still to come)**Minimum Agriculture Activity (still to come)**Combining the Markers into Decisions**Traffic Light System**Expert Judgement Application*

The similarity score is based on the assumption that all claims for a crop of a certain group, in the local neighbourhood (say within 20km and a similar altitude) should produce a similar signal (e.g. behaviour of vegetation index in a time-series). All non-border pixels in all parcels of that crop type can be extracted and compared and any deviation from similarity may be attributed to for example, a wrong claim, different farming practice, quality of soil (water availability), etc.

Hence, the similarity score evaluates how similar a *Feature Of Interest* (FOI) is to other FOIs from its neighbourhood having the same (or different) claim. For example, how similar is a cornfield to other cornfields in its vicinity? The similarity score is defined by the following equation

where *P*ᵢᵏ*(VI)* represents a *k*-th observation of Vegetation Index (*VI*) of an FOI with index *i*,

is mean of the same vegetation index estimated on the date of *k*-th observation for *n-nearest-neighbours* of an FOI *i *with *crop* type claimed, and

is their standard deviation. Sum *k* runs over all valid (i.e. cloudless) observations. The similarity score is in principle reduced χ*² statistics *related to likelihoods and null hypotheses. A low value of the similarity score for an FOI indicates that this FOI is similar to its neighbours with the same claim, and a high value indicates it is not, perhaps due to a wrong claim, different farming practice, quality of soil (water availability), etc. The similarity score could be:

- Cleaning training datasets (reduce label noise; eliminating wrong claims),
- early identification to farmers and others of potential mistakes in the claim,
- providing additional information in the Expert Judgment Application (where experts can make their interactive decision).

Similarity scores can easily be calculated automatically for all FOIs in the dataset, without requiring any model training or fine-tuning of some parameter settings. However, being based only on a single vegetation index, it is fairly crude.

The figure below shows NDVI profiles of a target FOI claimed as corn and an average NDVI profile of up to 500 cornfields within approximately 10 kilometres, at a similar altitude. In this case, the similarity score is 0.49, which indicates that the target FOI is similar to others with the same claim from the neighbourhood. This can be also visually confirmed by comparing the two NDVI profiles (target FOI in green dashed line with orange nodes, blue line representing the average neighbourhood corn value and light blue area indicating the standard deviation).

A more interesting example is shown in the Figure below. In this case, a target FOI claimed as permanent meadow is found to be not similar at all to other meadows from the neighbourhood (similarity score is 8.07). 99.8% of all FOIs with a meadow claim in the dataset have a similarity score with meadow hypothesis less than 8.07. The same target FOI has the smallest similarity score when compared with other cornfields in its neighbourhood. In this case, the similarity score is 0.48. Only 0.1% of FOIs with a meadow claim have a similarity score for corn hypothesis less than 0.48. Distributions of similarity scores for meadow and corn hypotheses for FOIs with meadow and corn claims are shown in the Figure below.

A plot of NDVI profiles also visually confirm the dissimilarity of target FOI with respect to meadows and similarity with respect to cornfields.

The above plots of similarity score distributions for meadow and corn hypotheses show that meadows can be well separated from cornfields based on any of these two similarity scores. This conclusion might be trivial for comparison of meadows and corn, but might not be so trivial when for example meadow is compared with a summer or winter wheat. Receiver operating characteristic curves (ROC) and evaluating area under the ROC curve (ROC AUC) can provide more discriminative qualities. The figure below shows ROC curves and ROC AUC for meadows using similarity scores calculated using meadow and summer wheat, corn or winter wheat hypotheses.

It is possible to extend such a study or comparison for all possible crop type pairs. Doing this for the whole dataset provides the matrix of ROC AUC values shown in the figure below. An element of the ROC AUC matrix (row A, column B) give ROC AUC value which is calculated between all FOIs with labels A and B having a non-nan (calculatable) similarity score for hypothesis B. FOIs with label A have non-nan similarity score for hypothesis B if there are at least 20 FOIs with label B in an area defined with a radius around 10 kilometres. True A cases will have a large similarity score for hypothesis B, while true B cases should have a smaller score. ROC AUC value close to 1 (0.5) indicates that the similarity score can (not) separate the two classes well.

Many of the elements in the matrix have high values, but there are blocks in the matrix with lower ROC AUC values. One such block is, for example, at the lower right corner of the matrix, corresponding to crop types with code 80X: winter cereals, which are of course very similar to each other.

## Other distance measures

The similarity score is closely related to Euclidian distance (ED) between two time-series **x** and **y** of length *m*, defined as:

where *i* denotes the *i*-th element of the time series. The ED assumes samples are exactly at the same (temporal) location. This assumption can be circumvented by resampling (e.g. linearly/nearest-neighbour,..) the source time-series to the target time-series.

Prior temporal alignment of time-series of FOIs that are compared through resampling can be avoided in the case of Dynamic Time Warping (DTW) (dis-)similarity measure. Here, time-series are optimally aligned (or *warped*) in the temporal domain so that the accumulated cost of this alignment is minimal. In the canonical form, this accumulated cost can be obtained by dynamic programming, recursively applying

for *i=1,..,M* and *j=1,…,N*, where *M* and *N* are the lengths of **x** and **y**. The local cost function *f()* will depend on the task-at-hand. In the case of uni-dimensional time-series, the square of the difference between *x*ᵢ and *y*ᵢ is usually taken. For multi-dimensional time-series, the Euclidean distance is often used. The final DTW measure corresponds to the total accumulated cost over *M* and *N*.

Use cases for Euclidian and Distance Time Warping distance measures can be for example:

- Identification of FOIs from the neighbourhood of target FOIs that have smaller distances to the target FOI. The Expert judgment application could, for example, display these FOIs that are closely related to target FOI in the feature space based on any of these distance metrics. Figures below show NDVI time-series of target FOI together with 5 FOIs with the smallest distance to the target FOI.

- By looking at distances between different crop types a developer of CAP markers and CAP system can get better insight and understanding what crop types (or groups) are (dis)-similar to each other. The figures below show violin plots of how well distance measure of FOIs claimed to cultivate crop type A are different when compared with FOIs claimed to grow something else.

Above we showed two example FOIs — one wrongly claimed to be a meadow and the other claimed to be a cornfield. The figures below show for the same two FOIs what we can learn by looking at Euclidian distances between these two FOIs and their 500 neighbours. For both FOIs we find out that they are most similar to corn or maize for silage. The fact that the wrongly-claimed meadow FOI is an outlier among other meadow FOIs in its neighbourhood is also indicated by a high (1.00) intra-class rank value. The latter represents the normalised rank of the average time-series-distance between target FOI and FOIs of the same declared crop-type, with respect to all the pair-wise distances between FOIs of same crop-type. The rank value can be used to identify outliers in a way similarly to the similarity score.

*Our research in this field is kindly supported, in grants and knowhow, by our cooperation in Horizon 2020 (**Perceptive Sentinel**, **NIVA**, **Dione**) and ESA projects (**Sen4CAP**).*