How to Improve Object Detection Evaluation

Carlos Uziel Perez Malla
Moonvision
Published in
5 min readApr 3, 2019

Evaluating statistical models after they have been fitted to some data is crucial to test its generalization capability. A model that generalizes well is able to perform on unseen data as good as it does on the data it was fitted to, thus making its predictions useful in practice.

The usual practice to evaluate a model is to split the available labelled data (pairs of inputs and expected outputs) into three sets, namely train, eval and test. The train set is used to train (create) the model, the eval set is used to evaluate the model during training and the test set serves as the final evaluation once the model has finished training. Though the reason of having eval and test sets is essentially the same, having a test set is useful as a final evaluation when the results on the eval set are used to make changes to the model (in order to improve this results). This way, even if the model has been fine-tuned to obtain the best performance possible on the eval set, it is guaranteed that the test set is completely independent and thus returns the final evaluation assessment.

This article focuses on evaluating object detection models.

Current object detection evaluation standards

Object detection challenges, such as PASCAL VOC, Google Open Images and COCO, are used as a reference on how to properly evaluate an object detection model, as their participants have access to their evaluation methods. This is especially useful for researchers, which can easily benchmark their new contributions against others’ by using the same evaluation framework. If you want to know more details about these challenges and how they perform evaluation, follow the links or visit this great repository.

In general, all of these evaluation methods offer evaluation metrics (i.e. different ways of measuring performance) at a global level. They return a metric (or a set of them) that evaluates how good the model did in the entire data set, usually based on Average Precision (AP). While having a global metric makes benchmarking much simpler, it does not provide enough insights regarding how good the detections were in each image, which is critical in a production environment. A more detailed evaluation would answer questions such as “Does the model perform substantially worse on images where lightning conditions change?”, Are there images incorrectly labelled? and so on.

At MoonVision, we developed an extended evaluation method that allows us to get these detailed insights, while at the same time keeping compatibility with VOC 2012 for benchmarking purposes.

Extended evaluation for object detection

To get reliable and useful insights we need to process detections to know how many True Positives (TP), False Positives (FP) and False Negatives (FN) appear in every image, and for that we need to compute per-image confusion matrices. In a classification task, the confusion matrix is easily built by placing each prediction in the row of the ground truth (the original class) and the column of the predicted class. However, it is not trivial to construct it for the object detection problem. For example, where do we place a prediction that matches the ground truth label but the overlap, measured as Intersection over Union (IoU), is below the established threshold? If we only look at the labels, it would be considered a TP, however, since the overlap is not enough, it has to be considered a FP. We need a way of dealing with this.

We can solve this conflict by using an additional class, namely “nothing” or “nothingness”. This smart trick was, to our knowledge, first introduced by Santiago L. Valdarrama in his fantastic post titled Confusion Matrix in Object Detection with TensorFlow. Here we can use the “nothingness” class to account for both FP (last row) and FN (last column). Therefore, predictions that have a correct predicted label but not enough IoU go into the last row, whereas the predictions that have an incorrect predicted label go into the corresponding [ground truth, predicted] index of the matrix. Also note that all ground truth objects that had not been detected (either because they were not matched by any prediction or because the ones that matched had a low IoU) go into the last column of the matrix. Consequently, it is important to understand that while a TP represents a single entry in the diagonal of the matrix, a FP involves two entries (a FP for the prediction itself and a FN for the ground truth that was not detected as a consequence of a bad match). In any case, the the sum of TP and FN must be always equals to the number of ground truths present (in a given image of in the entire data set). See an example on Figure 1.

Figure 1. Example of confusion matrices per image and the final aggregated confusion matrix using our extended evaluation on the toy data set from this repository. In this case, we only have one class (person), which has TP=7, FN=8 and FP=17 for the entire dataset.

Once we are able to compute the confusion matrices per image, we can obtain the number of TP, FP and FN for each class (and the total sum for all classes) of these images to gain the insights discussed before. For example, we can sort descending the images by the total number of FP or FN and render them on Tensorboard to find out in which images the model performs worst. Actually, in one of our projects this helped us to find images that had not been correctly labelled (i.e. that was the reason why the objects were never detected). Additionally, having a way of getting the images with the worst detections also allows us to obtain potential candidates for active learning. Naturally, all the confusion matrices per image can be aggregated to obtain a single confusion matrix for the whole dataset.

On the other hand, as mentioned earlier, we keep compatibility with VOC 2012 by computing the precision-recall curve of all detections and computing the AP per class (using all-points interpolation). As a small addition, we print the confidence intervals along the curve, which allows us to choose the right confidence threshold to filter out detections.

Conclusion

A production environment, where nearly perfect results are mandatory, requires much more robust solutions than in an academic environment. For that reason, we at MoonVision have developed this extended object detection evaluation, giving us better insights during our development cycles and thus making it possible to excel at what we do.

Check out what we do at https://www.moonvision.io/ and check our platform at app.moonvision.io/signup.

--

--