Error analysis for object detection models

Published in

Data Science at Microsoft

16 min readJun 28, 2022

A systematic, data-driven approach to understanding what hinders a model’s performance

Object detection example image containing several items in a desk labeled with bounding boxes. — Figure 1: Object detection example. Source: https://en.wikipedia.org/wiki/Object_detection.

As part of the Data & AI Service Line, my colleagues and I work with some of the largest Microsoft customers to solve some of their toughest problems. Among these projects, object detection engagements are particularly challenging, for the following reasons:

Lack of data is usually the limiting factor. People often spend a lot of time deliberating over design decisions such as selecting model architectures and tuning hyperparameters. However, we find again and again that, when solving problems in the real world, the easiest way to improve performance is to take a data-driven approach to identify areas where a model is struggling and collecting additional data to improve performance there.
Dataset labels are often inconsistent. It is extremely hard to build (or find) high-quality object-detection datasets. Consider the Figure 1 above: If you were to show this image to two different people and ask them to label the objects present (i.e., add boxes and labels), the result will most certainly not be the same. There will likely be some differences in the placement and size of the bounding boxes and, depending on how challenging the image and the use case are, the annotators may introduce different boxes. Moreover, this labeling process is tedious, which makes it even easier to introduce errors or inconsistencies during long sessions.
Standard metrics are hard to interpret. Mean Average Precision (mAP) — the go-to metric used to assess an object detector’s performance — is not intuitive and can make it difficult to gain an understanding of exactly how the model is performing, unlike accuracy, precision, or recall for classification problems. As is, this is not helpful for detecting areas where the model is performing poorly, much less for aiding in the designing of strategies to improve the situation.

In summary, we often have less than ideal datasets, a metric that is hard to interpret, and a lack of tools for identifying problems in the dataset. All these factors combined make it difficult to build an intuition about the problem at hand, and often make it unclear how to follow a systematic, iterative approach to improving model performance.

When looking for tools to address the issue, I came across the paper TIDE: A General Toolbox for Identifying Object Detection Errors, which introduces a methodology that I really like. The main idea is that first, all bounding boxes predicted by the model are assigned to an error category (or considered correct). After that, the negative impact that each one of these error categories has in mAP is calculated. This provides a measure of importance for the different error types that can help in focusing on the errors that hinder performance the most.

While the paper offers an implementation on GitHub, I find that it has two major drawbacks:

It offers only the final mAP impact results, with no easy way to get the error classification for every single prediction. In my opinion, it is extremely valuable to be able to inspect the different categories and understand the kind of error each prediction represents. This information can help build a good intuition about the problem.
There are some important details in the implementation that are not completely clear from the paper, and I found it tricky to understand them from the code that is available without intense debugging sessions.

In this article, I share my approach to error analysis for object detection — demonstrating how to leverage the predictions of a trained model to understand where the problems lie — by using and explaining my reimplementation of the TIDE paper. This implementation provides granular detail on the types of error classification and can be used to calculate the error impact for any metric, not just mAP; this is valuable in situations where the business is interested in additional metrics beyond the mAP, which is often the case.

IMPORTANT: All code shared in connection with this article is offered under the Apache 2.0 license.

What is error analysis?

Before going any further, it is important to clarify that error analysis and model evaluation are not the same. While evaluation consists of obtaining a single metric to summarize whether a model is doing generally well, error analysis can be thought of as the equivalent of debugging for Machine Learning systems, inspecting the outputs of the model, and comparing these to the ground truth, ultimately helping in building an intuition about the problem. It requires you to go deep, understanding your data and your model. Many times, this can involve looking at samples and predictions one by one.

Moreover, even if your model is performing well, there might be samples for which it consistently struggles — e.g., bad predictions affecting a minority scarcely present in your training set — and, for a real-world system, it may be important to understand whether these might be a problem once the model is deployed. Error analysis is the process that will help you understand this.

For more information on error analysis, these are great resources for delving deeper:

Now, let’s use an example to illustrate the systematic approach I usually follow when working on object detection problems.

The data

The first thing we need is data to use as an example. For that, we are going to the MS COCO 2017 validation set, one of the most popular benchmark datasets for object detection. We first download it to our working folder and then we load it and format it in convenient pandas DataFrames.

# Download images and annotations
!curl http://images.cocodataset.org/zips/val2017.zip --output coco_valid.zip
!curl http://images.cocodataset.org/annotations/annotations_trainval2017.zip --output coco_valid_anns.zip# Unzip images into coco_val2017/images
!mkdir coco_val2017/
!unzip -q coco_valid.zip -d coco_val2017/
!mv -f coco_val2017/val2017 coco_val2017/images# Unzip and keep only valid annotations as coco_val2017/annotations.json
!unzip -q coco_valid_anns.zip -d coco_val2017
!mv -f coco_val2017/annotations/instances_val2017.json coco_val2017/annotations.json
!rm -rf coco_val2017/annotations# Remove zip files downloaded
!rm -f coco_valid.zip
!rm -f coco_valid_anns.zip

images_df, targets_df = load_dataset()

Now, let’s look at a couple examples of samples in the dataset to get an idea of what they look like.

Example images with their annotations of the MS COCO dataset. — Figure 2: Examples of samples in the dataset. Source: MS COCO dataset (https://cocodataset.org/).

The model

As mentioned before, we want to leverage the predictions of a trained model to understand its shortcomings. For convenience and simplicity, we will use a model pretrained on the COCO dataset. That way, we can skip training altogether (which is not the point of this article) and the model will simply work out of the box.

While there are many architectures out there, we will use a Faster-RCNN with a ResNet50 backbone. The main reasons for this choice are that this architecture tends to perform reasonably well and that there exists a PyTorch implementation readily available.

To do any error analysis, we first need the model’s predictions in our dataset. Additionally, we will also save the model’s loss for each sample (and we will see why further on the article). Let’s define a function to do this for us and save these in a pandas DataFrame very much like the one we created for the targets.

The PyTorch implementation of Faster-RCNN has a few idiosyncrasies that make it impossible to obtain losses and predictions at the exact same time. While the code is provided, the details are out of scope. The way that you acquire losses and predictions will vary depending on the model used.

IMPORTANT: Running the following code on a CPU would take hours (if it even finishes); thus, I recommend you do it on a GPU machine, which will take a few minutes.

preds_df = get_predictions(images_path, images_df, targets_df)

Losses inspection

Before we delve deeper into the TIDE analysis, there is another tool that can also prove helpful for error analysis: model losses. Losses aim to measure how good or bad is a prediction. Thus, the highest losses will point to the images the model is struggling with the most. We can visualize them to try and understand what is going on. In fact, this approach is not unique to object detection. Any model that outputs a loss per sample can be used for this.

Before visualizing any images, it is valuable to inspect the distribution of losses. In general, we expect most of the images to have a relatively low loss and a few of them to present higher values. If that is not the case and all samples present roughly the same value, looking at the highest values will not be meaningful, as they will just be a result of small variations around the average.

Histogram of losses. — Figure 3: Loss distribution across images processed.

From the plot, we can confirm that indeed most images have a loss below 1 and the distribution is skewed to the right, with some samples having losses almost as high as 6 (hardly visible in the plot). Additionally, we see a big peak in the histogram close to 0 and a substantial quantity of samples with losses between 0.5 and 1. Therefore, let’s visualize the highest loss, one sample in the peak and one with an average loss, to appreciate the differences. Note that I manually selected images from a similar domain manually to simplify the reasoning, but the dataset contains all kinds of domains.

Figure 4: Image with the highest loss. Source: MS COCO dataset (https://cocodataset.org).

Figure 5: Image with an average loss. Source: MS COCO dataset (https://cocodataset.org).

Figure 6: Image with a low loss. Source: MS COCO dataset (https://cocodataset.org).

For the image with the highest loss, we see two main problems:

There are a lot of birds, probably flamingos, far away in the picture. The targets for them are inconsistent as a tiny fraction of them are labeled individually while all of them are inside a big box. The model fails to find any of them. In fact, a few of the images that have very high losses present a similar situation: small birds far away that the model fails to find.
There is a kind of animal present that does not have a bounding box. This probably means that COCO does not have a category for it and, therefore, it is not labeled. Nevertheless, our model does identify it as an animal, and not having a better alternative, labels it with words like zebra, horse, or cow.

If our domain was identifying animals in the savannah, this picture may suggest that making sure birds are labeled consistently and that we have boxes for all possible animals could lead to improvements. Nevertheless, this is a hypothesis that should be further explored with more samples.

For the image with an average loss, we see that there are correct predictions for most or all targets. The problem comes with the extra boxes that should not be there. This is not just for the example picked, but is generally true for most images with losses around the 0.5 to 1 range. With this image, one potential approach to follow to try and improve performance would be to remove extra boxes by dropping low scoring predictions, performing box fusion or applying Non-Max Suppression.

Finally, we see that for the low-loss image, the model did an almost perfect job. In fact, it detected an extra bird that was not labeled in the ground truth but is there!

While this is just a limited example, it is easy to see that the losses are already a great tool to formulate a hypothesis and direct our efforts toward the most problematic samples. On top of that, they usually provide valuable information about the problem, the model, and the dataset.

Error classification

Now, let’s finally look at how TIDE works and how we can leverage it for error analysis. While I strongly recommend that you read the paper for a deeper understanding, I aim to provide enough context here so that you can successfully leverage the tool in your project.

As mentioned above, TIDE either assigns each outputted prediction to an error category or considers it correct. To do so, it needs a mechanism to try to match each prediction to the target (bounding box) it might be trying to predict. The way it attempts that matching is by means of Intersection over Union (IoU). See the figure below for an illustration on how IoU is obtained for any two bounding boxes.

Illustration of Intersection over Union measure. — Figure 7: Intersection over Union. Source: https://en.wikipedia.org/wiki/Jaccard_index

Leveraging two IoU thresholds, the foreground threshold (Tf) and the background threshold (Tb), we can define the following error types (with a more detailed explanation in section 2.2 of the TIDE paper):

Classification error (CLS): IoU >= Tf for target of the incorrect class (i.e., localized correctly but classified incorrectly).
Localization error (LOC): Tb <= IoU < Tf for target of the correct class (i.e., classified correctly but localized incorrectly).
Both Cls and Loc error (CLS & LOC): Tb <= IoU < Tf for target of the incorrect class (i.e., classified and localized incorrectly).
Duplicate detection error (DUP): IoU >= Tf for target of the correct class but another higher-scoring detection already matched the target (i.e., would be correct if not for a higher scoring detection).
Background error (BKG): IoU < Tb for all targets (i.e., detected background as foreground).
Missed target error (MISS): All undetected targets (false negatives) not already covered by classification or localization error.

The following image (extracted from the paper) illustrates the different kinds of errors:

Figure 8: Error type definitions. We define six error types, illustrated in the top row, where box colors are defined as: red = false positive detection; yellow = ground truth, green = true positive detection. The IoU with ground truth for each error type is indicated by an orange highlight and shown in the bottom row. Duplicate error contains all colors: green, yellow, red from top to bottom boxes in that order. Source: TIDE: A General Toolbox for Identifying Object Detection Errors.

There is one detail that I found is not clear from reading the paper: The algorithm assumes that the model is doing the right thing as much as possible. To that end, it tries to match predictions to targets with the same label first, which in turn means that, for instance, a prediction will be matched to a target as a localization error before doing so as a classification error (even though the IoU is lower for the LOC target).

Let’s define some functions to help us classify these error types, based on the predictions and annotations in our DataFrame.

Now, let’s use the classify_predictions_errors function to understand the types of errors that our model is making:

errors_df = classify_predictions_errors(targets_df, preds_df)

While the number of errors of each type can be informative, it does not tell us the full picture. Not all errors will affect the metrics we care about in the same way. For some problems, having a lot of background predictions can be irrelevant because false positives are not a problem. In others, they could be a massive problem (e.g., identifying tumors in medical imaging). The classification of errors is important because it allows us to inspect predictions that represent a particular error and try to understand why that is happening. Nevertheless, the quantity of errors per category is usually not sufficient to get an intuition where the main problems lie in our use case.

Error impact

In any real-world scenario, there is a metric or set of metrics for which we want the model to perform well. Ideally, these metrics are aligned with the project’s goals and are a good summary of how successful the model is in doing the task at hand. In the previous section, we found the absolute count of different types of errors. How each of these types of errors affects our performance assessment will depend heavily on the metric at play. Therefore, we are interested in finding the error type with the greatest impact on our goals so that we can direct our efforts accordingly.

The intuition is the same one that is introduced in the TIDE paper: We can calculate the metric with the predictions of our model. Then, we can fix (i.e., correct for) one type of error at a time and recompute the metric to see what it would have been if the model did not commit that kind of error. Finally, we define the impact of each kind of error as the difference between the metric value after fixing it and the original value. This gives us a quantifiable result on how strongly our metric of interest is penalized by each type of error.

To that, then, we need to define what “fixing an error” means for each type. Once again, we simply use the methodology introduced in the TIDE paper. The explanations are more detailed than the ones in the paper (which they call “oracles” instead of “fixes”) with caveats that could be found only deep in their implementation.

CLS fix: Correct the label of the detection to the correct one. The correction is only applied for predictions that represent a CLS error and are the highest scoring predictions matching a target (IoU >= Tb), and that target has no OK prediction matched. All the CLS predictions not fulfilling the conditions stated are dropped.
LOC fix: Correct the bounding box of the detection to match the one of the matched target. The correction is applied only for predictions that represent a LOC error and are the highest scoring predictions matching a target (IoU >= Tb), and the target has no OK prediction matched. All the LOC predictions not fulfilling the conditions stated are dropped.
CLS & LOC fix: Because we cannot be completely sure of what target the detector was attempting to match, we just drop the prediction.
DUP fix: Drop the duplicate detection.
BKG fix: Drop the hallucinated detection.
MISS fix: Drop the missed target.

It is important to note the following: all the fixes stated above are non-overlapping. This means that they are defined in this particular manner so that the corrections are not conflicting. Each prediction can (and will) be corrected in one and only one way. For instance, this is the reason why for CLS, LOC errors for only the highest scoring predictions are fixed under certain conditions. Consequently, if all the fixes are applied simultaneously, a perfect metric is always the result because all targets are perfectly matched once and only once. Despite that, adding up each individual impact and the original metric value is no guarantee that the result will be the perfect score for the metric.

Now we will see an example of how this looks in practice. For this example, we are going to use the mean average precision (mAP) metric, as it is the standard go-to metric for object detection problems. If you are not familiar with mAP, I recommend that you watch this video starting at minute 37. It is one of the best explanations of the metric I have seen.

Let’s define some code to help us compute this metric. For this, we shall use the torchmetrics implementation, with some additional processing to help us convert our predictions and targets from our DataFrame into the format that torchmetrics requires.

Now, let’s define some functions to measure the impact that our errors have on our metric. Here, as our metric can be any callable, this implementation is not coupled to the MeanAveragePrecision implementation defined above.

Now, let’s use the calculate_error_impact function to understand the effect that our errors are having on our mAP.

Bar plot of the impact each error has on the metric. — Figure 9: Mean Average Precision value for IoU of 50% (mAP@50) and the impact of each error type on it.

Unsurprisingly, the base value of the metric is quite high. This is expected because the model was specifically trained to perform well on this validation set. While we see that there is a bit of contribution for most errors except duplicates, missed targets and background predictions make the greatest impact on performance.

Recall the inspection of images we saw in the loss section; these results may reinforce the points hypothesized before:

There are missing labels that the model correctly detects (e.g., the bird drinking with the giraffes) but are penalized as background errors.
There are objects that are not labeled because the dataset does not have a category (e.g., the animal pasturing with the zebras), or look like other objects that do have a category and are also penalized as background errors.
There may be some inconsistent labels for tough detections (e.g., flamingos or other tiny birds) that the model fails to detect and are classified as missed errors.

Now the next step would be to explore these hypotheses deeper to either confirm or discard them. We could now look at images that contain predictions classified as background error and see whether there are missing labels in the ground truth. If that were the case, we could go and fix those by adding the missing boxes and reevaluating again. Hopefully, our mAP would increase, and the background error contribution would decrease. Note that in this example, the problem would lie in the data and not the model. That is, the model was doing a better job than the metric was telling us, but we could not have known if we did not do a more thorough analysis.

As you can see, by means of error analysis we quickly devised a few hypotheses of what might be limiting the performance on our model, and with these hypotheses it is easier to devise potential strategies for improvement.

Conclusion

Here we have explored how to leverage error analysis for object detection problems. It is important to note that this is an iterative process. The idea is that you start somewhere and fix the most pressing issue and then the next one and then the next one until your solution meets the required criteria. In most if not all cases, this kind of approach is a much better recipe for success than endless parameter tweaking. On top of that, it is a great tool to build intuition and understand the problem at hand at a deeper level. This, on its own, is significantly important for systems that end up in the real world.

Finally, the underlying ideas are not unique to object detection. In most Machine Learning situations, there are ways to leverage the predictions of a trained model to direct the efforts toward a satisfactory outcome. I encourage you to think in these terms in the next problem you face, and I hope you will find what I have written here useful too!

Notes

If you wish to use the code presented, you will also find a set of unit tests for the functions introduced here.
An interactive version of this article as a jupyter notebook can be found here.

Bernat Puig Camps is on LinkedIn.