How to evaluate the performance of an object detection neural network using mAP

5 min readAug 27, 2021

By Ridha Moosa, Head of Operations at Enlabeler (

This is a 2 part article where we will investigate the mAP method of evaluating an object detection model. Part 1 will dive into the the metrics in question with some light mathematics and classification evaluation methods. Part 2 will look investigate a real world example of how the Enlabeler team built a testing platform to assess how well our labelers can annotate an image.

Part 1: Introduction to the metrics

Picture this. You spent hours gathering and cleaning your images to be fed into your well-layered neural network; hours and endless cloud credits to train your model to be the next object detection masterpiece that would put autonomous cars to shame. Only to learn that it doesn’t do object detection at the level you expect it to. It might not have a big or tight enough bounding box, or it might be picking up muffins when you’re actually looking for your dog’s face.

There are many solutions to evaluating the performance of your object detection model. In this article, we will be exploring the mAP (mean average precision) model from this Github repository by Cartucho. First we’ll look at each of the metrics used, in part 2, we will look at a quick implementation and a use case performed to test data labeling engineers at Enlabeler.

Let’s investigate where the model works best and how it’s architecture works.

Imagine dogecoin becomes a physical currency and you have built a model to detect if the image is a whole dogecoin or picks up on the face of the legendary doge, a Shiba Inu. Your model might perform in three different ways:

The model works around detecting the Intersection Over Union (IOU) of images which looks at the ratio of areas of overlap and union. A higher IOU, signifies a better performance. Ideally you’d want your model (red box) to fit the ground truth (green box).

In the image model above, we can would want the red box to match as closely to what we recognise as the “ground truth”.

IOU is calculated as the following:

But how would a machine learning engineer know what’s a good bounding box? They will set ground truths and determine their own personal threshold for what’s a tight box. A rule of thumb is as follows:

  • Negative: IoU ≤ Threshold
  • Positive: IoU > Threshold

Where a positive result means that the object is detected correctly and a negative result means that the object was not detected.

Great! So we know how to calculate the IoU’s for an image. To accurately measure how robust a model detects an object, hence, how well generalised a model works, a measure of central tendency needs to be used; in this case, we use the mean. More specifically, we will use the mAP , or mean average precision. In practice, you would probably have a whole bunch of images, sometimes millions, with many variations. In order to calculate and interpret the mAP, we should revisit concepts of precision and recall when dealing with classification models.

Think of precision as how well a model will detect a true positive case from all the positive cases. Say we have 10 fraud cases in 10000 transactions; we would like to know how, out of those 10 fraud cases, what is the proportion of fraud cases identified, taking into account that the model also predicted some false fraud cases.

Recall on the other hand accounts for all the positives correctly identified as well as the numbers that are missed. An example where recall is important would be in fraud detection; where out of 10000 transactions, you might only have 1 fraud. So although your model may be 99.9% accurate, you are more concerned with finding that 1 case that could have been potentially missed. A model like this is considered to have a high sensitivity. So the ratio of true positives to false positives is important.

So now that we have refreshed our memory of precision and recall, let’s look at AP, or average precision.

AP looks at how to summarise the entire precision-recall curve into a single metric for easy readability. The AP calculates this single metric for each class in our dataset. An easy formula to calculate AP is as follows:

mAP looks at the average of all these classes numbers that are calculated. mAP helps to assess the sensitivity of how the neural network works.This will give a single metric for how the model performs over many threshold points. The equation to calculate is is as follows:

k = class of AP

n = number of classes

This generalisation makes it a great determining number for how well a model performs in object detection. In the next article on this topic (part 2), we will look at the implementation of the code and a use case.

By Ridha Moosa, Head of Operations at Enlabeler (

Reach me on: |