Understanding Evaluation Metrics in Medical Image Segmentation

6 min readMar 1, 2023

Implementation of some evaluation metrics in Python

Here is the link to my Kaggle notebook code.

Now that you have a trained model for your segmentation task? But how do you know if your segmentation model performs well? In other words, how do we evaluate our model performance?

Evaluation metrics are the answer!

In this post, I will provide an overview of appropriate, most common evaluation metrics, demonstrate their interpretation and implementation, and propose a guideline to properly evaluate medical image segmentation performance to increase research reliability and reproducibility in the field.

Goal: to score the similarity between the predicted (prediction) and annotated segmentation (ground truth)

5 evaluation metrics:

Precision and Recall (Sensitivity)
Accuracy/Rand index
Dice coefficient
Jaccard index (IoU)

All presented metrics are based on the computation of a confusion matrix for a binary segmentation mask, which contains the number of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) predictions. The value ranges of all presented metrics span from zero (worst) to one (best).

What is a confusion matrix? And why is it important?

A picture is worth a thousand words!

Let’s interpret these terms in our context! Our goal in this challenge is to segment a mask of (Functional Tissue Units) FTUs for each histopathological image. Each mask is represented either by 0-background or 1-FTU. So,

TP (True Positive): represents the number of FTU pixels that have been properly classified as FTU
FP (False Positive): represents the number background pixels being misclassified as FTUs (due to misalignment)
FN (False Negative): represents the number of FTU pixels being misclassified as background
TN(True Negative): represents the number of background pixels that have been properly classified as background

Let’s dive deeper into each evaluation metric, and build our understanding from simple illustrations!

Precision & Recall:

For pixel classification:

Precision score is the number of true positive results divided by the number of all positive results

Recall score, also known as Sensitivity or true positive rate, is the number of true positive results divided by the number of all samples that should have been identified as positive

def precision_score_(groundtruth_mask, pred_mask):
    intersect = np.sum(pred_mask*groundtruth_mask)
    total_pixel_pred = np.sum(pred_mask)
    precision = np.mean(intersect/total_pixel_pred)
    return round(precision, 3)

def recall_score_(groundtruth_mask, pred_mask):
    intersect = np.sum(pred_mask*groundtruth_mask)
    total_pixel_truth = np.sum(groundtruth_mask)
    recall = np.mean(intersect/total_pixel_truth)
    return round(recall, 3)

Accuracy/Rand Index:

Accuracy score, also known as Rand index is the number of correct predictions, consisting of correct positive and negative predictions divided by the total number of predictions.

def accuracy(groundtruth_mask, pred_mask):
    intersect = np.sum(pred_mask*groundtruth_mask)
    union = np.sum(pred_mask) + np.sum(groundtruth_mask) - intersect
    xor = np.sum(groundtruth_mask==pred_mask)
    acc = np.mean(xor/(union + xor - intersect))
    return round(acc, 3)

Dice Coefficient (F1-Score):

F-measure, also called F-score: one of the most widespread scores for performance measuring in computer vision and in MIS (Medical Image Segmentation).

Dice coefficient is calculated from the precision and recall of a prediction. Then, it scores the overlap between predicted segmentation and ground truth. It also penalize false positives, which is a common factor in highly class imbalanced datasets like MIS.

Based on the F-measure, there are two popular utilized metrics in MIS:

The Intersection-over-Union (IoU), also known as Jaccard index or Jaccard similarity coefficient
The Dice similarity coefficient (DSC), also known as F1-score or Sørensen-Dice index: most used metric in the large majority of scientific publictions for MIS evaluation

The difference between the two metrics is that the IoU penalizes under- and over-segmentation more than DSC.

Dice coefficient = F1 score: a harmonic mean of precision and recall. In other words, it is calculated by 2*intersection divided by the total number of pixel in both images. Note that it’s possible to adjust the F-score to give more importance to precision over recall, or vice-versa. Refer to F-beta score for more info.

def dice_coef(groundtruth_mask, pred_mask):
    intersect = np.sum(pred_mask*groundtruth_mask)
    total_sum = np.sum(pred_mask) + np.sum(groundtruth_mask)
    dice = np.mean(2*intersect/total_sum)
    return round(dice, 3) #round up to 3 decimal places

Jaccard Index (IoU):

Jaccard index, also known as Intersection over Union (IoU), is the area of the intersection over union of the predicted segmentation and the ground truth

def iou(groundtruth_mask, pred_mask):
    intersect = np.sum(pred_mask*groundtruth_mask)
    union = np.sum(pred_mask) + np.sum(groundtruth_mask) - intersect
    iou = np.mean(intersect/union)
    return round(iou, 3)

Visualization + Conclusion:

Let’s put it all together and see how one metric is more suitable for a segmentation task than another!

Simple ground truth + predicted masks

Fig 1: Ground truth + prediction mask for pixel classification

metrics_table([gt_mask_0], [pred_mask_0])

╔═══════════╦════════╦══════════╦═══════╦═══════╗
║ Precision ║ Recall ║ Accuracy ║ Dice  ║  IoU  ║
╠═══════════╬════════╬══════════╬═══════╬═══════╣
║       1.0 ║  0.615 ║      0.8 ║ 0.762 ║ 0.615 ║
╚═══════════╩════════╩══════════╩═══════╩═══════╝

Observations:

Precision score = 1! What’s an awesome prediction! But, wait! Isn’t the predicted mask differed from the ground truth by 5 pixels? Is there anything wrong in this metric? Let’s recall that precision score is the total number of true positives divided by the total number of all positive results in the prediction. Coincidently, those numbers are equal (=8), Thus, the precision score = 1.
Recall = IoU
Dice > IoU

Ground truth + predicted masks from the challenge

╔═══════════╦════════╦══════════╦═══════╦═══════╗
║ Precision ║ Recall ║ Accuracy ║ Dice  ║  IoU  ║
╠═══════════╬════════╬══════════╬═══════╬═══════╣
║     0.256 ║  0.256 ║    0.828 ║ 0.256 ║ 0.147 ║
║     0.721 ║  0.721 ║    0.951 ║ 0.721 ║ 0.563 ║
╚═══════════╩════════╩══════════╩═══════╩═══════╝

Observations:

High accuracy, but low Dice and IoU score. Why is it? Let’s recall how the accuracy score is calculated. The accuracy score is the total number of correctly predicted pixels (1-FTU, 0-background) divided by the total number of predictions. We can notice that there is only a small percentage of pixels of the FTUs compared to the background. Thus, if we use the accuracy score, those correctly classified backgrounds will dominate the whole numerator term and result in an illegitimate high scoring.
Precision = Recall = Dice
Dice coefficient > IoU

Conclusion

In this post, I’ve demonstrated 5 evaluation metrics in Medical Image Segmentation (MIS).

Precision and Recall (Sensitivity)
Accuracy/Rand index
Dice coefficient
Jaccard index (IoU)

However, it’s strongly discouraged to use accuracy in MIS since MIS has highly imbalanced classes between ROIs vs. background. In other words, medical segmentation image usually contains a small percentage of pixels in the ROIs, whereas the remaining image is all annotated as background. Accuracy score includes true negative results, thus will always result in an illegitimate high scoring.

In contrast, dice coefficient and IoU are the most commonly used metrics for semantic segmentation because both metrics penalize false positives, which is a common factor in highly class imbalanced datasets like MIS. However, choosing dice coefficient over IoU or vice versa is based on specific use cases of the task.