Getting to the truth of your labels: how to be confident in the absence of ground truth data

Pascal Philipp
UK Hydrographic Office
9 min readApr 6, 2020
Image provided by UK Hydrographic Office © Crown copyright 2020 UK Hydrographic Office

The Data Science team here at the UKHO creates machine learning models that are trained to solve global geospatial problems, such as detecting certain features in satellite images.

To gather the labelled data required for training and testing, we can’t always rely on ‘on the ground’ observations (ground truth), but often work with labels that are based on the interpretation of imagery and other data. Variation in human interpretation, low resolution image data, and potential ambiguity of the task introduce uncertainty that can make it difficult to train our models and to test and compare them in meaningful ways.

We were faced with such uncertainty during work on the mangrove project, which was covered from a Data Science and a Data Engineering perspective in recent blog posts. To deal with the large volume of data required, only parts of the labelling for the training and testing of the mangrove model were carried out by remote sensing experts. The remaining labels were assigned by non-expert analysts. Later, when reviewing the performance of the model, some of its classifications appeared to be more accurate than the labels. Hence the model may actually be performing better than its quantitative assessment against possibly imperfect labels suggests. This can make it difficult to quantify and describe the skill of the model, to inform decisions during training, and to report the level of confidence we may have in the model’s classifications to the customer. We have therefore started work on dealing with uncertain labels and on deriving confidence statements for situations when no ground truth data exists.

When assessing the performance of a ML model, a number of different performance metrics are often considered: balanced accuracy, precision, recall, kappa statistic, etc. These measures are usually obtained by comparing the model’s classifications to labels that were assigned by expert analysts or possibly non-experts. However, what if some of these labels are ambiguous as a result of variation in the interpretation of imagery? In order to provide meaningful confidence statements for our models, we need to take this possibility into account.

So, if our ML model classified a data point as the object of interest, what is the probability that it really is the object of interest?

In the following, we’ll be using kelp as the object of interest, as this is one of our current projects (an image segmentation task on satellite images), but the thoughts presented can be applied to most classification or segmentation tasks.

The distinction between labels and the truth can be quite important for some tasks. For others it is less significant, but even for those cases, the techniques below provide some useful insight. For example, while there are no global kelp data sets, there are some local data sets (created by academic studies or local authorities) that we can compare to subsets of our global classifications. Say we do this in the Falklands (lots of kelp) and then in the Gulf region (a relatively small amount of kelp). The Gulf comparison is likely to have a much better rate of agreement because there is less kelp to possibly have discrepancies on. This doesn’t mean that our classifications in the Gulf are better, right? Some of the thoughts below show how to carry out such comparisons in a more robust way.

As with any mathematical modelling, there are preconditions that should be met. Here, the main assumption is that the labels are assigned by experts, e.g. by members of the UKHO Remote Sensing team. Since we can’t compare labels to ground truth data, the idea for assessing their quality is to compare them to labels assigned by an independent labeller. If the labellers are experts, any discrepancies will be due to ambiguity and ill-definedness of the task (e.g. borderline cases, differences in the way the boundaries of kelp patches are drawn, sub-optimal resolution of the images etc.), and the amount of discrepancy will lead to an understanding of the amount of uncertainty in the task. Labelling carried out by non-experts such as crowdworkers (e.g. Amazon Mechanical Turk) would require additional modelling since then, ‘proper mistakes’ and possibly systematic misinterpretations need to be taken into account.

The path to confidence statements

So, if our ML model has classified a pixel as kelp, what is the probability that this pixel really is kelp? For this we first note that the distinction ‘kelp or not kelp’ exists on three levels:

1. Does the pixel contain kelp or not? (Truth)

2. Has the pixel been labelled as kelp? (Label)

3. Has the pixel been classified as kelp by the ML model? (Classification)

The different options are displayed in the following diagram.

Diagram generated by the author using Excel

A plus sign stands for ‘kelp’ and a negative sign for ‘not kelp’. For example, among all the pixels in all the scenes considered, there will be some pixels that:

1. truly contain kelp (T+)

2. that have been labelled as kelp (L+)

3. and that have been classified as kelp by the ML model (C+).

We can denote such pixels [T+, L+, C+] — kelp that was labelled correctly and classified correctly. A pixel that does not contain kelp, was labelled as not kelp, but classified as kelp would be denoted [T-, L-, C+], and so on. In total, there are eight possibilities, corresponding to the eight different paths from Pixel to the C’s on the right of the diagram. Each of the line segments has an associated probability (a value between 0 and 1; not displayed) and multiplying these probabilities along a path gives the rate at which the corresponding type of pixel occurs. For the case [T+, L+, C+], we multiply the three probabilities along the path Pixel — T+ — L+ — C+. By combining several such computations, we can answer the question above and hence quantify our confidence in the classifications made by the ML model.

Here are intuitive descriptions of the probabilities needed in the diagram:

1. In the first level, Pixel — T, the probabilities are given by the prevalence of kelp. For example, if the region where the classification algorithm is deployed contains 1% kelp on average, then we put the probabilities 0.01 and 0.99 on the line segments leading from Pixel to T+ and to T-, respectively.

2. The probabilities in the second level, T — L, depend on the amount of ambiguity in the labelling task. The formula for those probabilities depends primarily on the rate of agreement between the labels assigned by two independent remote sensing experts.

3. Level 3, L — C, is where the ML model operates. The probabilities here are derived from the confusion matrix that the trained model produces on a test set (a hold-out subset of the labelled data).

Obtaining the rate of agreement

We now discuss how to find the rate of agreement that is required for the probabilities in the second level of the diagram. For this, let us introduce Alice and Bob — two remote sensing experts tasked with labelling a large set of satellite images; say, 25 Sentinel scenes. The Data Science team will then train and test a ML model on that labelled data.

As mentioned in the previous sections, the rate of agreement between independent experts is needed to assess the well-definedness or ambiguity of the task. Hence some images need to be labelled twice — once by Alice and once by Bob. Now, labelling satellite images is a laborious task and it would be inefficient to have all 25 scenes labelled twice. The size of the overlap should really be kept as small as possible. So, how about Alice and Bob label 13 scenes each, i.e. only one scene will be labelled by both? This causes some complications though. Consider the following figure:

Figure generated by the author using LaTeX

The left image contains one patch of kelp and the right image contains four patches. Alice’s labels are drawn in blue and Bob’s labels are drawn in red. On the left, the mismatch rate (that is, 1-rate of agreement) is 2/100 = 2% and on the right, it is 8/100 = 8%. We have a significant difference here, even though the way in which the boundaries of kelp patches are interpreted differently is perfectly consistent! That is, had we happened to choose the right image for the comparison experiment, we would obtain a mismatch rate four times the mismatch on the left image! To deal with this issue, we would have to choose an image with an average amount of kelp in it — but this may not be practical or possible.

The solution is to focus on the coloured cells only (i.e. the pixels that were marked as kelp by at least one image analyst) and to extrapolate the obtained information on the amount of agreement to the full scenes. The main ingredient for doing this is the Sørensen-Dice coefficient (also known as the F1 score), and this approach leads to a more robust way of finding the rate of agreement between Alice and Bob.

This technique has allowed us to keep our labelling requests of the Remote Sensing team as low as possible and still obtain a stable measure of well-definedness for the classification task. The ideas described can also be applied to other situations such as comparing our own labels to local data sets. For example, we may compare our labels to a local data set in the Falklands and to another local data set in the Gulf region — the prevalence of kelp is quite different in those two regions and the above approach will automatically take this difference into account.

Example: clear-cut vs fuzzy

To illustrate the impact of the rate of agreement on the confidence statement, consider the following scenario.

1. The prevalence of an object of interest (OOI) in the region where a classification model is to be deployed is 10%.

2. Alice and Bob have labelled a data set that contains about 20% of the object of interest (i.e. the images that were labelled for training and testing the ML model contained twice the usual amount of the OOI).

3. The Data Science team has trained a model and found it to have a true positive rate of 0.91 (aka sensitivity: 91% of the pixels labelled as OOI are classified as OOI) and a true negative rate of 0.99 (aka specificity: 99% of the pixels labelled as not OOI were classified as not OOI).

Now suppose a pixel has been classified as the object of interest. What is the probability that it really is the object of interest?

If Alice and Bob’s labels agreed perfectly (100% agreement), the answer would be 91%.

If the rate of agreement between Alice and Bob was only 82%, the answer would be 56%!

Benefits and outlook

The above ideas are rather technical, but they address business needs as well as analytical questions:

1. We can report the confidence we have in the ML model’s classifications to customers or stakeholders.

2. The mathematical formulas for the confidence can show the Data Science team how to optimise the ML model.

3. The methods from the section Obtaining the rate of agreement allow us to keep the number of scenes to be labelled by remote sensing as low as possible.

4. These methods further make it easier to compare to existing local data.

5. The impact that the prevalence at deployment (level Pixel –T in the diagram) has on confidence can be quantified more precisely now, and this informs the Data Engineering team’s choice of filtering rules when deploying the trained model.

In conclusion, the work presented here adds new value to our image segmentation products. Collaboration with the Remote Sensing team has been a great learning experience and there are new interesting endeavours on the horizon: We have recently been exploring different approaches to AI-assisted labelling, and the thoughts above may turn out to be quite relevant for developing this further. Overall, there’s some interesting work happening at the UKHO…

This is work in progress — please join the discussion and leave some feedback in the comments section below 😄

--

--