Seeing Clearly: Demystifying Object Detection Performance Metrics

14 min readFeb 21, 2024

This is the 6th and the second last installment of our Series. Check out the Whole Series here: Series Introductory Blog.

Introduction

Performance metrics are the key tools to evaluate the accuracy and efficiency of object detection models. They shed light on how effectively a model can identify and localize objects within images. In this guide, we will explore various performance metrics, their significance, and how to interpret them. It gives us insights for model evaluation, comparison, and improvement.

Why Evaluation Metrics?

One of the mistake a beginner can make is not evaluating their model after building it. That is, not knowing how effective and efficient their model is before deploying. And it could be disastrous.

An evaluation metric measures the performance of a model after training. You build a model, get feedback from the metric, and make improvements until you get the accuracy you want.

Choosing the right metric is also a key thing for evaluating model because choosing metric of our kind also depends on our requirement. Choosing a single metric might not be the best option, sometimes the best result comes from a combination of different metrics.

Remember that metrics aren’t the same as loss functions. The loss function shows a measure of model performance during model training. Metrics are used to judge and measure model performance after training.

Performance Metrics for Classification

Confusion Matrix

It is a table that is often used to describe the performance of a classification model, it establishes relationship between actual values and predicted values. Confusion Matrix itself is not a metric but its significance is a lot as we calculate many metrics on the basis of this matrix.

Now let’s interpret Confusion Matrix

Actual values are based on ground truth ,that is real one whereas predicted values are based on the model prediction. Confusion Matrix helps us to establish relationship between these two values.

As you can see there are four terms in it now let’s understand them

True Positive (TP): These are instances where the object detection model correctly identifies and localizes objects, and the Intersection over Union (IOU) score between the predicted bounding box and the ground truth bounding box is equal to or greater than a specified threshold.
False Positive (FP): These are cases where the model incorrectly identifies an object that does not exist in the ground truth or where the predicted bounding box has an IOU score below the defined threshold.
False Negative (FN): FN represents instances where the model fails to detect an object that is present in the ground truth. In other words, the model misses these objects.
True Negative (TN): Not applicable in object detection. It represents correctly rejecting the absence of objects, but in object detection, the goal is to detect objects rather than the absence of objects.

Thanks for Manal El Aidouni for this excellent figure.

Now, let’s move on to the core metrics:

1. Accuracy

Accuracy measures the proportion of correct predictions (both true positives and true negatives) among all predictions made by the model. Accuracy indicates the percentage of correctly classified sentiments (positive or negative) out of all predictions made by the model.

Why we need Precision and recall when we have accuracy metric??

Accuracy is useful when the target class is well balanced but is not a good choice for the unbalanced classes. Imagine the scenario where we had 99 images of the dog and only 1 image of a cat present in our training data. Then our model would always predict the dog, and therefore we got 99% accuracy. In reality, Data is heavily imbalanced and our model is not well enough to predict for unseen data and accuracy metric is not showing us the real picture .So there is need of other metrics like precision ,recall etc.

For Imbalanced datasets:

1. Precision:

Precision measures the proportion of true positive predictions among all positive predictions made by the model.

Example: In a spam detection system, precision indicates the percentage of correctly identified spam emails out of all emails classified as spam.
There are 4 cases in spam classification
Case 1 : The mail is a spam and the prediction is, it is a spam mail (TP). Here the model’s prediction is good
Case 2 : The mail is a spam and the prediction is not a spam mail (FN). Here even though, the model’s prediction is wrong, it won’t cause any issues to the user.
Case 3 : The mail is not a spam mail, but the prediction is ,that it is a spam mail (FP). Here the model’s prediction is wrong, and it will cause issues to the user because in this case, important mails can also be classified as spam. So here we need to decrease FP, so precision metric is to be used.
Case 4 : The mail is not a spam mail and the prediction is, it is not a spam mail (TN)
Precision tells us out of all the predicted results, how many of them are correctly predicted.
In precision, our focus is to increase TP and TN and to decrease FP.

2. Recall (Sensitivity):

Recall measures the proportion of true positive predictions among all actual positive instances in the dataset.

Example: In medical diagnosis, recall indicates the percentage of correctly identified patients with a specific disease out of all patients who actually have the disease. Let’s say the disease is cancer here.
There are 4 cases in if a person has cancer or no:
Case 1 : The person has cancer and the prediction is, the person has cancer (TP)
Case 2 : The person does not have cancer and the prediction is, that he does not have cancer (TN)
Case 3 : The person does not have cancer but the prediction is he does have cancer (FP) which wont cause any issues
Case 4 : The person have cancer but the prediction is he does not have cancer (FN) which will cause issues (FN) because then the infected person won’t be able to get treatment. So here we need to decrease FN so, we need to use recall over here.
Out of all the actual results, how many of them are correctly predicted results.
In recall, our focus is to increase TP and to decrease FN

NOTE : We need to know when to use precision or recall, it is dependent on our use case whom to give more importance.

def calculate_precision_recall(list):
  precision_recall_dict={} 
  for idx,element in tqdm(enumerate(list), total=len(list)):
    true_positives, false_positives, false_negatives = element
    precision_ = true_positives / (true_positives + false_positives + 1e-6)
    recall_ = true_positives / (true_positives + false_negatives + 1e-6)
    precision = round(precision_, 4)
    recall = round(recall_, 4)
    precision_recall_dict[f'Class {idx+1}'] = {'Precision': precision, 'Recall': recall}

  # Print the results
  print('\n')
  idx=1

  def caps(input_str):
    return input_str.upper()

  for class_name, metrics in precision_recall_dict.items():
    print(caps(CLASS_NAME[idx]))
    print(f"{class_name}: Precision = {metrics['Precision']:.4f}, Recall = {metrics['Recall']:.4f}")
    print()
    idx+=1

  return precision_recall_dict

3. F-beta Score:

The F-beta score is a weighted harmonic mean of precision and recall, where the parameter beta controls the balance between precision and recall.

When the balance becomes equal, beta will become 1, hence, F-1 score. It is the most commonly used form of this metric, and it will simply become the HM of Precision and Recall.

4. Average Precision (AP)

Average Precision (AP) hold paramount significance in the assessment of object detection models, particularly within the realm of computer vision. AP delves into the precision-recall trade-off by evaluating an object detection model’s precision-recall values by varying confidence score threshold. Precision signifies the accuracy of the model’s positive predictions, while recall quantifies the model’s ability to successfully identify all relevant objects. AP achieves a harmonious balance between false positives and false negatives, encapsulating the intricacies of the model’s performance. It does so by computing precision-recall values at different confidence thresholds, as mentioned earlier, forming a precision-recall curve, with the area under this curve (AUC) representing AP — higher AUC values indicate superior model performance.

Average Precision is the area under the Precision-Recall curve

Methods for Calculating AP:

1. Approximate the PR curve with rectangles:

For each precision-recall pair (j=0, …, n-1), the area under the PR curve can be found by approximating the curve using rectangles.
The width of such rectangles can be found by taking the difference of two consecutive recall values (r(k), r(k-1)), and the height can be found by taking the maximum value of the precision for the selected recall values i.e. w = r(k)-r(k-1), h = max(p(k), p(k-1))

The average precision can be calculated by approximating the area under the curve as the sum of the areas of these rectangles.

Calculating Average Precision from PR curve using Rectangle Approximation

Width and height of each rectangle can be calculated as follows:

Then, the average precision will be sum of areas of these rectangles.

In my project, I have used a modified version of this technique where instead of rectangles, we’ve used trapezoids. Although it is computationally intensive, it provides a more precise result for metric calculation.

2. 11-Point Interpolation:

11-point interpolation refers to a method of approximating a function’s value at intermediate points using a set of 11 data points.
The recall values between [0, 1.0] are considered with an increment of 0.1.
For the recall values at [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] the precision is calculated.

Take the maximum Precision value to the right of each Recall value. In other words, it finds the highest Precision corresponding to a Recall value greater than the current Recall value.

In simple words, Start from the last precision value keep moving to the left as soon as a higher precision value is found update the precision value.

Calculating Average Precision from PR curve

Average Precision is then simply averaging the precisions at a set of these 11 recall points.

Wait, yes I know! I can understand your dilemma that how it calculates area under the curve but this is what it is! It does so, this is how it is defined and we all use this assumption.

Now let’s look into the implementation part:-

NOTE:- This is just an example, nothing to do with my project, also in this example there is only one class label however there can be multi-class labels.

In this example, we have only 3 images in the test dataset.

1. Get all bounding box prediction on your data

First step is get all bounding box prediction, here Red ones are predictions and Green ones are ground truth boxes. Predicted bounding boxes has its confidence score.

2. Set an IoU threshold value

In this step you have to set an IoU threshold. IoU of two boxes is basically area of intersection of those boxes / area of union of those boxes.

We will consider only those predictions ‘true positive’ whose IoU with ground-truth box is more than IoU threshold. If model incorrectly identifies an object or its IoU is lesser than threshold we will consider that prediction as ‘False Positive’.

3. Keep all Prediction irrespective of image sorted in descending order of their confidence.

Predictions sorted on the basis of their confidence score

We have sorted images on the basis of their confidence scores. Sorting is irrespective of class or image. Sorting on the basis of confidence helps us to consider those predictions first in which the model has highest confidence.

4. Calculate Precision and Recall as we go through output.

Now we have sorted predictions, we have its verdict whether it is TP,FP or FN. We will go through our output and calculate Precision and Recall at each steps. So we will end up with multiple values of precision and recall like in the images shown below.

Calculating Precision and Recall for each prediction

Finally we end up with multiple precision recall values.

5. Average Precision Calculation

We have multiple precision recall points. For a particular recall value, only take the max precision value occurring for that value.

For example: 3/5,3/6,3/7 occurs for the same recall value i.e. 3/4, then (3/5,3/4) will be considered only as a valid pair in place of these 3 pairs.

Average precision is area under this Precision Recall Curve.

We call easily calculate this area using Trapezoidal Rule.

NOTE:-

For ‘Trapezoidal Rule Formula’, in our case f(x) is basically precision value for different recall values(x) and difference between two recall values(delta x) as you can see in the plot.

The code for our Project specific implementation:

from collections import OrderedDict
import os
dict1={}

test_image_path ="/content/drive/My Drive/helmet_dataset/test/images"

CLASS_NAME = ['__background__', 'helmet', 'head', 'person']
num_classes = len(CLASS_NAME)

#act_box_all_images = dictionary containg actual bounding boxes and labels for each image , len=500
#pred_box_all_images = dictionary containg predicted bounding boxes, labels & scores for each image , len=500

def calculate_avg_precision_per_class(act_box_all_images, pred_box_all_images, class_idx, iou_threshold=0.5):
  for image_name, y in pred_box_all_images.items():
    for idx,lbl in enumerate(y['labels']):
      if(lbl==class_idx):
        temp= y['scores'][idx]
        dict1[temp]= [ y['boxes'][idx], image_name ]

  #dict1 contains each class specific predictions with scoreas as keys, bboxes and imag_name in a list as values
  # example : dict1= {0.90: [  [10,20,30,40] , 'image_name1' ], 0.86: [ [22,45,60,63] , 'image_name2' ], 0.91: [ [17,12,13,65] , ['image_name] ]

  sorted_dict = OrderedDict(sorted(dict1.items(), reverse = True)) # this sorts dict1 in descending order by their keys, scores here.
  #print(sorted_dict)

  len_gt=0 # for counting the total no. of gt_boxes for that class in the whole dataset(all 500 test images).
  dict_check ={}

  for image_name, y in act_box_all_images.items():
    counter=0
    for idx, lbl in enumerate(y['labels']):
      if(lbl==class_idx):
        counter+=1
        len_gt+=1
      dict_check[image_name] =[0]*counter #creating a check dictionary to track which gt boxes are already matched
    #print(dict_check)

  listr =[] # list for containing all recall values
  listp =[] # list for containing all precision values for corresponding recall value in listr.
  tp= 0     # accumulating #tp values
  fp= 0     # accumulating #tp values
  fn= len_gt - tp # as tp+fn = no. of total gt_boxes

  test_image_path ="/content/drive/My Drive/helmet_dataset/test/images"
  #image_name = 'hard_hat_workers549.png'

  counter =0
  for score,y in sorted_dict.items():
    total_gt_boxes =[]
    input_image = y[1]

    act_box, act_lbl = get_box_lbl(os.path.join(test_image_path,input_image))
    for idx,lbl in enumerate(act_lbl): # collecting all gt_boxes for that lbl in the image of predicted box.
      if(lbl==class_idx):
        total_gt_boxes.append(act_box[idx])
          #print(total_gt_boxes)

          #p_voc_format_act_box[image_name] = {'boxes':boxes, 'labels': lbls}
    temp =0
    iou_max = 0

    for idx, box in enumerate(total_gt_boxes): #calculating IoU of predicted box with all gt_boxes and storing the max IoU value.
      if (dict_check[input_image][idx] !=1):
        val = calculate_iou(y[0], box)
        iou_max= max(iou_max, val)
        if (iou_max==val) : temp = idx

    if (iou_max > iou_threshold):
      #print('hi')
      tp += 1
      dict_check[input_image][temp] = 1
    else:
      fp += 1

    # print(dict_check[input_image][temp])

    #calculating precision & recall for that class on that confidence score threshold
    precision = tp/(tp+fp)
    recall= tp/len_gt
    listp.append(precision)
    listr.append(recall)

  # taking only the max precision values for a particular recall value
  recall_last=0
  final_precision_values= []
  for i,value in enumerate(listr):
    if (value > recall_last):
      recall_last=value
      final_precision_values.append(listp[i])

  #print(final_precision_values)
  precision_tensor = torch.tensor(final_precision_values, dtype=torch.float32) # converting them to tensors.
                                                                               # torch.trapezoid only takes tensor as input.
  interval_width = 1/len_gt #interval length at which (x,y) value pairs are taken.
  ap_value = torch.trapezoid(precision_tensor, dx= interval_width) # calculating area by using Trapezoidal Rule
  #ap_value will be our Average Precision(AP) value for that Class.

  ap_value_scalar = ap_value.item()
  #print(ap_value_scalar)

  return class_idx, ap_value_scalar

6. Mean Average Precision:

AP value can be calculated for each class. The mean average precision is calculated by taking the average of AP across all the classes under consideration, i.e.

Mean Average Precision — The mean of Average Precision (AP) across all the k classes

from tqdm import tqdm
# this func. simply calculates the mean average precision by taking the mean of AP values of all the 3 classes.
def mean_avg_precision(act_box, pred_box, num_classes, iou_threshold=0.5):
  avg_precision_dict={}
  for class_idx in tqdm(range(1, num_classes)):
    id, ap_value_scalar = calculate_avg_precision_per_class(act_box_all_images, pred_box_all_images, class_idx, iou_threshold=0.5)
    avg_precision_dict[id] = ap_value_scalar

  final_sum=0
  for sum in avg_precision_dict.values():
    final_sum+=sum

  map_val = round(final_sum/ len(avg_precision_dict), 4)
  map_val = map_val *100

  return map_val

7. Specialized Variants — mAP@0.50 and mAP@0.50–0.95:

mAP@0.50: This metric assesses how well a model can locate objects with a moderate Intersection over Union (IoU) overlap of at least 0.50 (50%) with a ground truth object.
mAP@0.50–0.95: The “Average” of the mean average precision calculated at varying IoU thresholds, ranging from 0.50 to 0.95 with an interval of 0.05. It gives a comprehensive view of the model’s performance across different levels of detection difficulty.

Tips:

Deciding whether to calculate mAP (Mean Average Precision) per class or per image?

The decision to calculate mAP (Mean Average Precision) per class or per image depends on the specific requirements and objectives of your object detection task.

1. mAP per Class:

Advantages:

Provides insights into how well the model performs for each individual class, irrespective of class imbalance.
Helps identify which classes the model struggles with, which can guide further improvements.

Considerations:

It may overemphasize the performance of the majority class if the dataset is heavily imbalanced.
AP per class could still be useful for identifying class-specific weaknesses, especially for minority classes.

2. mAP per Image:

Advantages:

Evaluates the model’s performance on a per-image basis, providing a more balanced view across the entire dataset.
Less affected by class imbalance compared to AP per class because it evaluates the overall performance regardless of class distribution.

Considerations:

May not provide detailed insights into how well the model performs for each individual class, especially for minority classes.
Might mask specific class-related issues if the imbalance is severe.

Recommendation:
If given significant class imbalance in your dataset, it’s advisable to prioritize evaluating the model using AP per Image. This metric offers a more balanced assessment of overall performance across all classes and is less influenced by the dominance of the majority class.

For balanced dataset, mAP per Class would be more preferrable.

Conclusion

The article provides a comprehensive overview of performance metrics for object detection models. It emphasizes the importance of evaluating model’s performance and efficiency before deployment. Key metrics covered include the confusion matrix, accuracy, precision, recall, F-beta score, Average Precision and mean Average Precision (mAP). The text explains how to interpret these metrics and highlights the significance of choosing the right evaluation metric based on the specific requirements. Additionally, it discusses the computation of AP using different techniques and offers insights into their practical implementation and how we move on from there to calculate mAP which serves as the most comprehensive metric for a model. Overall, the article serves as a valuable guide for understanding and interpreting performance metrics in object detection, aiding in the evaluation and improvement of model performance.

Do Follow for more such content!!!

Get an email whenever Vikash Kumar Thakur publishes.

Get an email whenever Vikash Kumar Thakur publishes. By signing up, you will create a Medium account if you don't…

medium.com

References:

Aladdin Persson video on Mean Average Precision Calculation.
Henrique Vedoveli Metrics Matter: A Deep Dive into Object Detection Evaluation.
Neptunes.ai for performance metric concepts.

Seeing Clearly: Demystifying Object Detection Performance Metrics

Introduction

Why Evaluation Metrics?

Performance Metrics for Classification

Confusion Matrix

1. Accuracy

Why we need Precision and recall when we have accuracy metric??

For Imbalanced datasets:

1. Precision:

2. Recall (Sensitivity):

3. F-beta Score:

4. Average Precision (AP)

Methods for Calculating AP:

1. Approximate the PR curve with rectangles:

2. 11-Point Interpolation:

Now let’s look into the implementation part:-

1. Get all bounding box prediction on your data

2. Set an IoU threshold value

3. Keep all Prediction irrespective of image sorted in descending order of their confidence.

4. Calculate Precision and Recall as we go through output.

5. Average Precision Calculation

We call easily calculate this area using Trapezoidal Rule.

NOTE:-

6. Mean Average Precision:

7. Specialized Variants — mAP@0.50 and mAP@0.50–0.95:

Tips:

Deciding whether to calculate mAP (Mean Average Precision) per class or per image?

1. mAP per Class:

2. mAP per Image:

Recommendation:If given significant class imbalance in your dataset, it’s advisable to prioritize evaluating the model using AP per Image. This metric offers a more balanced assessment of overall performance across all classes and is less influenced by the dominance of the majority class.

For balanced dataset, mAP per Class would be more preferrable.

Conclusion

Get an email whenever Vikash Kumar Thakur publishes.

Get an email whenever Vikash Kumar Thakur publishes. By signing up, you will create a Medium account if you don't…

References:

Written by Vikash Kumar Thakur

Recommendation:
If given significant class imbalance in your dataset, it’s advisable to prioritize evaluating the model using AP per Image. This metric offers a more balanced assessment of overall performance across all classes and is less influenced by the dominance of the majority class.