A discovery dive into the world of evaluation — Do’s, don’ts and other considerations
written by Annika Reinke (a), Minu D. Tizabi (a) and Carole H. Sudre (b, c)
(a) Div. Computer Assisted Medical Interventions and HIP Helmholtz Imaging Platform, German Cancer Research Center (DKFZ)
(b) Centre for Medical Image Computing and Medical Research Council Unit for Lifelong Health and Ageing at UCL, University College London
(c) School of Biomedical Engineering and Imaging Science, King’s College London
Evaluation metrics? What for?
So you newly developed a model or an algorithm. You’re proud of it, but how can you prove that it works as expected? Or that it surpasses other available solutions? This is where the validation part of your work begins! And to validate, you need to evaluate… While qualitative results assessment may demonstrate that your system works as you want it to, it is certainly not sufficient to convince your peers of the quality of your proposal. Enter evaluation metrics! There exist a multitude of metrics covering different aspects of the respective problem at hand and allowing you to quantify your solution’s performance. In this blogpost, we will present some typical examples, explain what exactly they allow you to assess, examine their strengths and limitations and investigate how to usefully combine these metrics (and when not to).
Disclaimer: The world of performance measures is huge. This post is in no way an exhaustive review of all existing evaluation measures for all possible tasks! We will focus on common problems related to classification and segmentation and aim to provide the reader (you!) with some aspects worth considering when deciding on how to evaluate your new method.
A few definitions to start with
Many evaluation metrics rely on the definition of four important terms: True positive (TP), true negative (TN), false negative (FN) and false positive (FP), as depicted in the confusion matrix in Figure 1. In the following, we will explain what they refer to when we talk about classification and segmentation.
Classification
Let’s consider an example of classifying whether an image contains a penguin (positive) or not (negative). Our dataset contains 20 images, 9 of which contain a penguin, the remaining 11 other animals. Your classifier predicts 7/9 penguins correctly, but also marks one of the other animals as a penguin (see Figure 2). The TP, TN, FN and FP refer to the image-level classification and are summed over all images to produce a metric score. More on this later.
The number of true positives refers to the number of samples that were correctly classified as the positive class. In our example, we classified 7 penguins correctly, so TP = 7.
The number of true negatives refers to the number of samples that were correctly classified as the negative class. In our example, 10 other animals were classified correctly as being no penguins, so TN = 10.
The number of false negatives refers to the number of samples that were incorrectly classified as the negative class. In our example, 2 penguins were classified as not being a penguin, so FN = 2.
Finally, the number of false positives refers to the number of samples that were incorrectly classified as positive samples. In our example, one animal was classified as penguin, so FP = 1.
Segmentation
For segmentation tasks, i.e. partitioning an image into multiple segments as sets of pixels, we can also use TP, TN, FN and FP to compute our performance measures. But in this case, the classification occurs at the level of single pixels/voxels instead of whole images. We assess for each pixel/voxel whether it is a TP, FN, TN or FP in comparison to the reference labelling. For instance, in binary segmentation tasks, where we are deciding whether a pixel/voxel belongs to the object (positive) or the background (negative), we will compare it to the established gold standard/reference segmentation.
Classification metrics
The classical one is not always the right choice!
There may possibly be some classical metrics that you are naturally inclined to choose as your “go-to” metrics. In the following example, we will show how some classical metrics may actually trick you into believing that your system is extremely performant.
Accuracy is a measure that is often introduced as one of the go-to evaluation measures in classification problems. It describes the ratio between all samples that are correctly classified over the total number of samples. It is calculated as
Imagine you are given the task to classify patients into sick (positive) or healthy (negative) classes. So, you build your model, start training and now evaluate your results on a test set of 100 patients.
You compute your results and wow — 97% Accuracy! Your algorithm must be extremely good. But before popping champagne and writing a paper about your amazing method, you should probably take a step back and look at your data in depth.
And, surprise: In your test dataset, indeed, you have 97 sick subjects, and 3 healthy. Your dataset was highly imbalanced.
And there goes the problem: Your model only learned to perform a majority vote, meaning that all patients have been assigned to the majority class and marked as sick, as shown in Figure 3!
In our case, your algorithm predicts 97 sick cases correctly (TP = 97) but neglects the 3 healthy patients. Your FP rate is therefore 3. No FN, no TN. This results in 97/(97+3)=0.97. You see that Accuracy is a metric not designed for imbalanced datasets. But don’t worry, solutions such as balanced accuracy that weigh the different classes according to the inverse of their prevalence have been introduced exactly for this type of problems!
Besides the Accuracy, there exist a range of metrics, all based on the four terms TP, TN, FN or FP. The Recall (also known as true positive rate or sensitivity) determines how many positive samples are identified from all positive samples. On the other hand, Precision (also known as positive predictive value) refers to how many samples are relevant, calculated by dividing the number of TP by all samples marked positive by your method. Specificity measures the negative samples by calculating the ratio between the correctly classified negative samples and all classified negative samples.
Although these three metrics look very similar in formulas (see Figure 4), they all measure slightly different things. For some applications in the medical domain, for example, it is very important to find all positive samples. In this case, having some false positives may be acceptable, as long as you don’t overlook any true positives, for example polyps from laparoscopic images. For other problems, it may be important to reliably identify the negative class as well. Depending on the driving clinical question, you have to pick the correct metric. It can often also be beneficial to calculate more than one to make sure your algorithm successfully detects various properties.
Based on the important definitions of TP, TN, FN and FP, we can identify further classification metrics. Some of the most commonly known beyond the ones previously mentioned are the following (see Figure 4 for formulas): The F1 score is computed as the harmonic mean of Precision and Recall. It can be further generalized to the Fß score that better handles class imbalance. A metric to detect rare events is the Threat score. A more complex metric is Matthews correlation coefficient (MCC), a metric well-suited for imbalanced classes [1]. All of the metrics mentioned so far are bounded between 0 and 1, with 1 being a perfect value and 0 the worst. For MCC, the values are bounded between -1 and 1. Again, 1 means a perfect prediction, -1 corresponds to a complete disagreement and 0 refers to a random guess.
Regardless of what metric you choose in the end, it may be a good idea to just report the raw numbers of TP, TN, FN and FP. Then, everyone will be able to calculate various metrics, which is important to ensure reproducibility and interpretability of your results.
So there are many metrics to validate your classification algorithm, each of them capturing slightly different properties. But wouldn’t it be great to have a metric that could directly tell you how well your model distinguishes between the positive and negative class? Luckily, there is the Area under the Receiver Operating Characteristic Curve (AUC or AUROC). Typically, a classification model will classify objects with a certain probability (for example, you are 87% certain that you predicted a penguin in one image, not a nun). Based on this, you will decide whether your object refers to the positive or negative class, for example everything with a value larger than 50% would be a TP (see Figure 5). For different thresholds, you compute your TP, TN, FP and TN and the resulting Recall and Specificity scores scores. Plotting them against each other for all thresholds will result in the ROC curve. As we cannot easily convert this into an interpretable value, we therefore calculate the area under this curve, the AUC. An AUC of 1 means that you distinguish perfectly between the classes, 0 that you interchange them and 0.5 indicates random guessing. It needs to be noted, however, that the AUC also comes with some drawbacks [3, 4].
Multi-class classification
This is valid for binary classification in tasks related to the identification of a condition (e.g is there a tumour in the image? Does the subject suffer from multiple sclerosis? Are there cancerous cells in the histopathological slice?). However, it may happen that multiple classes may have to be separated. These categories may either be totally disjoint (e.g. different subtypes of neurodegeneration or different objects) or represent categories with an order (grading of cancer stage or disease status). Specific metrics again must be chosen to properly assess the performance of a solution in a multi-class setting.
To first visualize the performance of multiclass classification, similarly to binary cases, confusion matrices (see Figure 1) are a very useful tool. Not only does it give an idea of the number of correct predictions but also the most likely sources of confusion. This is particularly useful in ordinal settings.
“Classical” binary metrics can be combined and used in a multi-class setting to provide a unique global measure of performance [5]. Two strategies can usually be used when combining the performance obtained for each class individually: micro and macro-averaging. While the macro-averaging corresponds to the (weighted) average of the class-specific measure, micro-averaging considers the metrics at the level of the samples themselves. There are even strategies to account for the multi-class component and get versions of AUC measures.
Specific metrics have been proposed to specifically take care of the case of multi-class problems. Intra-class correlation [6] is typically used to assess how a set of raters agree in their classification in multiple groups. But nothing prevents the raters to be algorithms instead of human experts!
The main problem with the measures mentioned above is that the cost of an error is always the same: taking a subject with normal cognition for a patient suffering from Alzheimer’s Disease (AD) “costs” the same in measure terms as mixing between a patient with AD and another one with advanced mild cognitive impairment. Again, there are ways to adjust the cost of errors to the different classes in an ordinal setting. At this stage, one can actually consider this classification problem as a regression problem and adopt associated measures of performance, looking in particular at distances [7].
While the question of measures related to regression problems falls out of the scope of this post, it is important to keep in mind how porous the border between tasks can be!
Image segmentation
Let’s consider another type of task: Image segmentation, i.e. classifying all object pixels/voxels to the object class and all other pixels/voxels to the background class or, in simpler words: drawing an outline around an object. How do you evaluate the performances here? Well, there are many paths you might take. You may first want to look at the level of overlap between your result, the quality of the border or other properties (granularity of the segmentation).
As mentioned earlier, it is possible to consider the labelling of each pixel/voxel in your image as a classification problem which allows you to apply measures of evaluation similar to the ones that were described for the classification tasks. However, you may encounter an ever stronger degree of imbalance that would make some of these (e.g. Specificity or Accuracy) completely meaningless.
Overlap measures
If you search for metrics measuring overlap, you will probably find heavy usage of the Dice Sørensen Coefficient (DSC) or the Intersection over Union (IoU) (also known as Jaccard Index). Figure 6shows how they indeed reflect the degree of overlap between object A and B.
They are then defined as
One must note the direct relationship between the two metrics. Indeed, 1/DSC = ½ + 1/(2IoU). Therefore, if we consider two measures DSC1 and DSC2 with DSC1 < DSC2, we will also have IoU1 < IoU2. These two measures representing the quality of overlap will thus show the same ranking (see Figure 7). They are therefore redundant and do not need to be presented simultaneously (choose either one or the other!).
Computing the overlap between structures seems reasonable for segmentation tasks. So can we stop here and just use DSC or IoU? Not so fast! Consider the example presented in Figure 8. Prediction 1 and 2 both measure an object with the correct volumes (a 3x3 square), but at the wrong position, so no overlap between them, meaning your DSC will result in 0. But Prediction 1 is definitely closer to the correct solution than Prediction 2. In this case, it would make sense to further compute a distance-based metric, such as the Hausdorff Distance (HD). Comparing the two values of the HD gives us what we expect: the distance is smaller for Prediction 1, thus it is the better prediction. You probably want to know how to use this metric! But wait a minute.
We will discuss distance-based metrics in a bit and, for now, stay with the DSC for some more examples. Why? Because it is the most frequently used metric in segmentation competitions, as demonstrated in [8] — and it is that for a reason. Indeed, the metric is suitable for many use cases, but there are some caveats. When evaluating large objects, DSC is not a bad choice. Unfortunately, for smaller structures, a single-pixel difference between the reference and the prediction may already lead to a very large drop in the DSC scores, see Figure 9. For a larger structure, this drop is much smaller. So if your data contains many objects of different sizes, consider other metrics or combine the DSC with distance-based metrics.
Furthermore, overlap measures are typically not designed to measure differences in shape. So a DSC value will not tell you whether your object has the correct shape. Figure 10 shows several examples that differ tremendously in shape, resulting in the exact same DSC value! So what to do?
Outline quality measures
As mentioned above, you may find examples where the DSC would be the same but one segmentation would have roughly the same border while another may be perfect with an exactly matching border except for a strong difference at another point. There are three commonly used evaluation measures for the assessment of the quality of the border:
- Hausdorff distance (HD) (yes, that’s the one we saw before!) [9]: Don’t worry, the formula looks a bit scary. But it just stands for the following: The metric is defined as the longest of the shortest distances between two outlines. In brief it means that we would list each point on the surface of the predicted segmentation and add to a list the smallest distance to the surface of the reference labelling. Finally, we would consider the maximum over that list. We would then do the same in the other direction and take the maximum over the two measures.
- Average symmetric surface distance (ASSD) [10]: This consists in calculating the average of the mutual distance between a voxel contained in an object and the closest element of the other labelling.
- Surface Dice (NSD) [11]: As the name suggests, the metric is a mixture of the DSC and a surface distance metric. It considers the overlap between surfaces, rather than volumes. It actually has the same formula regarding TP, FP and FN as the DSC, but we define them differently. Instead of considering a pixel as TP when it overlaps the reference, we measure the distance between the two outlines. If it is below a certain threshold, then we consider the pixel as TP, otherwise it is a FP or FN, depending on whether you are currently comparing reference with prediction or vice versa.
The ASSD is generally less sensitive to outliers than the HD. As per its definition, the HD indeed uses non-robust statistics (min and max), but this drawback can be mitigated by using more robust versions such as the 95th percentile version that consists in considering the 95th percentile instead of the maximum.
Measures associated with other properties
Beyond distance and overlap, one may want to look at other aspects relevant for the task at hand such as the granularity of the obtained segmentation compared to the reference, measures considering the entropy or exotic metrics that penalize over- or undersegmentation. The list is huge and unfortunately, we cannot cover all of them. If you are interested, check out some work on metrics [2, 10, 12–14], as in connectivity metrics and intensity-based Jaccard [15], shown in Figure 11.
Multi-class segmentation
The previously presented evaluation strategies are classically used for binary cases with only one type of object to be segmented, like the brain, the liver, the tumour… In other contexts, such as the delineation of multiple organs, we do not only have one but multiple classes to consider. What, then, about the evaluation? One could say that we simply need to consider each class as a binary segmentation and then average the results over the different classes of interest. This can seem a reasonable solution but may not be very representative if the volumes of the different classes vary greatly. Let’s imagine a situation when we have three classes, one being 1,000 times larger than the other two. Given the relationship between volume and overlap measures such as DSC, with a resulting individual segmentation of 90% for the largest class and 50 and 20 for the two others, the overall result would be 53.3. Instead, weights can be attributed to each class according to their volume. Specific solutions have been developed to better account for across classes volume imbalance, as the Generalized Dice (or Jaccard) proposed by [16].
Surrogate measures of segmentation quality
While direct comparisons to a reference are classically used for the validation of new techniques, they are rarely feasible at large scale. You can well imagine that the manual segmentation of hundreds or thousands of cases by expert raters is particularly challenging to achieve. In order to assess large-scale validity of the obtained results, the clinical context of the original task can potentially help in identifying surrogate measures of validity, like a biomarker (e.g. volume of multiple sclerosis lesions) that is strongly correlated with a clinical phenotype (e.g. disability score). We can try to see if such correlation can be found when applying the proposed method on a larger population. Alternatively, if we know that two groups are different in terms of the expected segmentation, one can check whether the statistical difference is stronger using one method than another.
How to describe an evaluation measure distribution?
You have eventually obtained the metrics for all the cases you wanted to assess. But what now? How can you properly describe your results without putting every single value for each case in one huge table? First, you have to decide on your strategy when dealing with cases for which the chosen metric is not defined. This often happens when both reference and segmentation results are empty. In the case of an empty reference, the correct answer would naturally be an empty segmentation. However, any size of error (may it be 1 voxel or 10,000) would provide the same DSC score: 0. Is it best, then, to attribute 1 to these NaN results, or to consider the empty cases separately ?
On the other hand, it may happen that you don’t obtain results for all images, for whatever reason. This especially occurs in competitions: perhaps one participant just forgets to submit one image — we call this missing values. In this case, you should set these scores to the worst possible value, for instance 0 for the DSC or IoU.
One important aspect when assessing your method is not only to look at the mean (or median) of the distribution but also at the variability. For example, even despite having a slightly higher median, a method that has a few complete failure cases might be considered worse than a method that is consistently successful, albeit to a slightly lesser degree (see Figure 12).
Therefore, it is essential to always provide measures of variability as a complement to measures of central tendency. This can be the range, the interquartile range or the standard deviation, for instance.
Another aspect to naturally consider are the types of measures of distribution you should provide. If the distribution is not normal, median and interquartile range measures are better suited than mean and standard deviation. You’d better check!
Are the results really better? What to do to check?
In order to compare the results distribution between your method and already existing ones, you will need to go one step further than simply qualitatively comparing the mean (or median) values. A statistical test is necessary, but once again you need to choose wisely! If the distributions are continuous and normal, then a paired t-test will be suitable, but don’t forget the non-parametric Wilcoxon test if it is not the case…
If, beyond a single comparison, you want to rank multiple methods based on multiple metrics, solutions based on rank statistics are also available [17, 18].
Is your data hierarchical?
Often, the situation might be even more complicated. Consider the example of brain tumour segmentation. You are very proud of your heterogeneous dataset because it allows you to train generalizable algorithms that can be used in practice! The data has been collected from several hospitals, from each of which you investigate different patients, perhaps examined with different devices. Do you see the hierarchical structure? The different levels of your tree structure (as shown in Figure 13) may affect your results, introducing various statistical effects. The first thing to keep in mind is to present your metric values in dependence of your structure. If you are showing boxplots for all metric values, show them for each hospital/device/… separately and report mean or median values individually for every hospital/device/….
To statistically analyze your results, you might use advanced methods like linear mixed models. But in this case, you should consult a statistician to ensure correctness of the results [19]. Keep in mind: It is always better to ask for help than to just do something. This holds true for everything we touched on in this post.
Conclusion
At first glance, validation seems to only make up a footnote within your method development pipeline. However, the topic is much more complex than many people think! The more you reflect about it, as we did in this blogpost, the more overwhelmed you may begin to feel given the abundance of metrics to choose form and aspects to consider. It may seem hard to see the forest for the trees — but don’t worry. For now, after reading this blogpost, we hope that when arriving at this stage of your work you will simply question your intended evaluation measures before using them! Are they reflecting the underlying clinical problem? Are they covering the range of properties you want to assess? And remember, don’t use just any metric because your colleague does… Think first, then act!
Acknowledgements
This blog post was written by three members of a Delphi consortium with the goal of defining best practice recommendations for metrics in biomedical image analysis. The work was initiated by the Helmholtz Association of German Research Centers in the scope of the Helmholtz Imaging Platform (HIP).
We would further like to thank the Delphi consortium, namely Lena Maier-Hein, Paul Jäger, Matthias Eisenmann, Annette Kopp-Schneider, Tim Rädsch, Doreen Heckmann-Nötzel, Michela Antonelli, Tal Arbel, Spyridon Bakas, Peter Bankhead, M. Jorge Cardoso, Veronika Cheplygina, Beth Cimini, Keyvan Farahani, Ben Glocker, Patrick Godau, Fred Hamprecht, Daniel Hashimoto, Michael Hoffmann, Fabian Isensee, Pierre Jannin, Charles E. Kahn, Jens Kleesiek, Michal Kozubek, Tahsin Kurc, Bennett Landman, Geert Litjens, Amin Madani, Klaus Maier-Hein, Erik Meijering, Bjoern Menze, Henning Müller, Felix Nickel, Jens Petersen, Nasir Rajpoot, Mauricio Reyes, Michael Riegler, Nicola Rieke, Bram Stieltjes, Ronald Summers, Sotirios A. Tsaftaris, Bram van Ginneken and Anne Martel.
References
[1] Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics, 21(1), 1–13.
[2] Reinke, A., Eisenmann, M., Tizabi, M. D., Sudre, C. H., Rädsch, T., Antonelli, M., … & Maier-Hein, L. (2021). Common limitations of image processing metrics: A picture story. arXiv preprint arXiv:2104.05642.
[3] Ravaut, M., Sadeghi, H., Leung, K. K., Volkovs, M., Kornas, K., Harish, V., … & Rosella, L. (2021). Predicting adverse outcomes due to diabetes complications with machine learning using administrative health data. NPJ digital medicine, 4(1), 1–12.
[4] Lobo, J. M., Jiménez‐Valverde, A., & Real, R. (2008). AUC: a misleading measure of the performance of predictive distribution models. Global ecology and Biogeography, 17(2), 145–151.
[5] Hossin, M., & Sulaiman, M. N. (2015). A review on evaluation metrics for data classification evaluations. International Journal of Data Mining & Knowledge Management Process, 5(2), 1.
[6] Bartko, J. J. (1966). The intraclass correlation coefficient as a measure of reliability. Psychological reports, 19(1), 3–11.
[7] Cardoso, J. S., & Sousa, R. (2011). Measuring the performance of ordinal classification. International Journal of Pattern Recognition and Artificial Intelligence, 25(08), 1173–1195.
[8] Maier-Hein, L., Eisenmann, M., Reinke, A., Onogur, S., Stankovic, M., Scholz, P., . . . others (2018). Why rankings of biomedical image analysis competitions should be interpreted with care. Nature communications, 9(1), 1–13.
[9] Huttenlocher, D. P., Klanderman, G. A., & Rucklidge, W. J. (1993). Comparing images using the Hausdorff distance. IEEE Transactions on pattern analysis and machine intelligence, 15(9), 850–863.
[10] Yeghiazaryan, V., & Voiculescu, I. D. (2018). Family of boundary overlap metrics for the evaluation of medical image segmentation. Journal of Medical Imaging, 5(1), 015006.
[11] Nikolov, S., Blackwell, S., Zverovitch, A., Mendes, R., Livne, M., De Fauw, J., … & Ronneberger, O. (2018). Deep learning to achieve clinically applicable segmentation of head and neck anatomy for radiotherapy. arXiv preprint arXiv:1809.04430.
[12] Taha, A. A., & Hanbury, A. (2015). Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC medical imaging, 15(1), 1–28.
[13] Nai, Y. H., Teo, B. W., Tan, N. L., O’Doherty, S., Stephenson, M. C., Thian, Y. L., … & Reilhac, A. (2021). Comparison of metrics for the evaluation of medical segmentations using prostate MRI dataset. Computers in Biology and Medicine, 134, 104497.
[14] Kofler, F., Ezhov, I., Isensee, F., Balsiger, F., Berger, C., Koerner, M., … & Menze, B. H. (2021). Are we using appropriate segmentation metrics? Identifying correlates of human expert perception for CNN training beyond rolling the DICE coefficient. arXiv preprint arXiv:2103.06205.
[15] Cárdenes, R., de Luis-Garcia, R., & Bach-Cuadra, M. (2009). A multidimensional segmentation evaluation for medical image data. Computer methods and programs in biomedicine, 96(2), 108–124.
[16] Crum, W. R., Camara, O., & Hill, D. L. (2006). Generalized overlap measures for evaluation and validation in medical image analysis. IEEE transactions on medical imaging, 25(11), 1451–1461.
[17] Antonelli, M., Reinke, A., Bakas, S., Farahani, K., Landman, B. A., Litjens, G., … & Cardoso, M. J. (2021). The Medical Segmentation Decathlon. arXiv preprint arXiv:2106.05735.
[18] Wiesenfarth, M., Reinke, A., Landman, B. A., Eisenmann, M., Saiz, L. A., Cardoso, M. J., … & Kopp-Schneider, A. (2021). Methods and open-source toolkit for analyzing and visualizing challenge results. Scientific Reports, 11(1), 1–15.
[19] Holland-Letz, T., & Kopp-Schneider, A. (2020). Drawing statistical conclusions from experiments with multiple quantitative measurements per subject. Radiotherapy and Oncology, 152, 30–33.