Specificity and Sensitivity in Data Science

by Dr Shruti Turner

Published in

Trusted Data Science @ Haleon

8 min readJun 17, 2024

In data science, evaluation metrics are crucial for assessing the performance and effectiveness of models. A classification model in data science is an algorithm that categorises data into predefined classes or labels based on input features. It is used to predict the class or category to which new data points belong. These models are used across business contexts and industries, e.g. marketing, auditing, fraud detection, medical diagnoses.

At Haleon, we use metrics as a basis to evaluate our models to make sure they are scientifically robust, choosing the metrics to optimise based on the specific business need and the risks associated with the context. We want to create tools that not only meet the business need but do so in a transparent way that we can explain and justify. We want to deliver trusted science, which forms the foundation of our data science. Without using the appropriate success metrics, we may not know the robustness of our models, or even know if they are producing outputs correctly. It is important that we understand the strengths and weaknesses of our models to be able to determine the risks we are taking when using them.

Specificity and sensitivity are used to evaluate classification models. These metrics help evaluate the performance of a model in distinguishing between different classes, providing a more nuanced understanding of its effectiveness than overall accuracy alone. However, the only way to truly understand how models should be optimised is to understand the context in which the model is being applied, allowing us to understand which metrics should be prioritised for optimisation.

In this article, we will explore the components that underly sensitivity and specificity, then dive into the details of these two metrics and how they can be utilised with a worked example.

The Confusion Matrix

To really understand specificity and sensitivity, we should understand the confusion matrix.

It provides a breakdown of the model’s prediction compared to the actual labels, helping to identify how many data points were correctly or incorrectly labeled (or classified). The possible relationships of the model label compared to the actual label can be represented in a matrix (figure 1) or visually (figure 2).

Figure 2: Visual representation of Confusion Matrix

In Figure 2, the small circles represent the data points in the sample (total 23). The small circles outside the circular boundary, are the data points that the model has classified as negative (total 14) — split between the data points correctly classified as negative (true negatives — 9) and the data points incorrectly classified as negative (false negatives — 5).

Inside the boundary of the circle are the data points that the model classified as positive. These are split between the data points correctly classified (true positives — 5) and incorrectly classified (false positives — 4). The visualisation shows that the number of true positives, false positives, true negatives, and false negatives must add up to the total number in the sample.

Using this understanding, we can delve into two important metrics, sensitivity and specificity.

Sensitivity

Sensitivity, also known as recall or the true positive rate, measures the proportion of data points with positive labels that are correctly classified by the model.

Importance of Sensitivity

Sensitivity reflects the model’s effectiveness in classifying data points with a positive label. A high sensitivity value indicates that the model successfully labelled most of the data points with an actual positive label, minimising the number of false negatives (data points with a label of positive that the model has classified as negative).

If a model has high sensitivity, it means the model correctly classified a high proportion of actual positives as positives. When sensitivity is high, the denominator (TP+FN) remains relatively small compared to the numerator (TP), indicating few false negative classifications.

Sensitivity is particularly important in scenarios where missing a positive label can have serious consequences. For instance, in medical diagnostics, high sensitivity ensures that most patients with a disease are correctly classified and receive necessary treatment. This minimises the risk of untreated conditions, where the model may classify a patient with a disease as not having one.

Specificity

Specificity, also known as the true negative rate, measures the proportion of data points with a negative label that are correctly classified by the model.

Importance of Specificity

Specificity reflects the model’s effectiveness in correctly classifying data points with a negative label. A high specificity value indicates that the model successfully classifies most of the data points with actual negative labels, minimising the number of false positives (data points with a label of negative that the model has classified as positive).

If a model has a high specificity, it means the model correctly classifies a high proportion of actual negatives as negatives. When specificity is high, the denominator (TN+FP) remains relatively small compared to the numerator (TN), indicating few missed negative classifications.

High specificity is particularly important to avoid misdiagnosing healthy individuals as diseased, preventing unnecessary anxiety and treatment. These can have serious consequences for health and finances, amongst other factors.

There is, however, a trade off between these metrics which must be optimised for the specific need. The prioritisation of metrics is determined by the real-world application in which the model will be used. The real-world application determines exactly what is “riskier” — a false positive or a false negative?

Worked Example

Context

Let’s say you are working for a hospital, and you are tasked with checking whether patients are “at risk” for a condition based on their medical history. Instead of doing this manually, you use a classification model to classify patients into “at risk” (i.e., a positive label) or not (i.e., a negative label).

You want to know how good your model is, so you see how well the model predicts the labels for a sample of 100 patients, 40 of which are “at risk” and 60 of which are not.

When you compare the output of the model to the actual labels, you find that of the 40 “at risk” patients, the model only classified 35 of them correctly. Of the 60 patients not “at risk”, 15 patients were classified incorrectly.

How can we use this information to evaluate the model’s performance?

Thinking in terms of the confusion matrix, we should identify the number of patients in each category that were classified correctly and incorrectly, i.e. calculate TP, TN, FP, FN.

Directly from the context, we can deduce the following (figure 3):

TP — the number of “at risk” patients the model classified correctly (35)

FP — the number of patients who were not at risk classified incorrectly (15)

We know from the Confusion Matrix that actual positives are a sum of the true positives and the false negatives, and that the actual negatives are a sum of the false positives and the true negatives. Ergo, we can calculate TN and FN:

TN — the number of patients who were not at risk classified correctly (45)

FN — the number of “at risk” patients the model classified incorrectly (5)

Figure 3: Completed Confusion Matrix for Worked Example

We can substitute the values to the sensitivity and specificity calculations to understand the model's performance:

Specificity = 45 / (45+15) = 45 / 60 = 0.750

Sensitivity = 35 / (35+5) = 35 / 40 = 0.875

What does this tell us?

The model has a specificity of 0.750, which means that the model correctly classifies patients not at risk 75.0% of the time. As the model has a sensitivity of 0.875, it shows that it classifies “at risk” patients correctly 87.5% of the time.

For this context, we can evaluate the specificity of 0.750 to understand that for every 100 patients that are classified as not being at risk, 25 (or 1/4 of patients) who are not “at risk” will receive interventions or treatments even though they don’t need it.

From the sensitivity of 0.875, we can deduce that 12.5% of patients who are “at risk” do not get classified correctly, meaning for every 100 patients who are “at risk”, 13 (rounding 12.5 to whole humans!) will not receive the appropriate treatment/intervention which could potentially prevent the disease and save their life.

From here, it is up to you (and your stakeholders) to determine whether this is an acceptable level of risk. Or should you be refining the model to improve one or both of these metrics? Which is the more important one — is it better to treat people who don’t need it or not treat people who do? This question can really only be answered by experts in the medical field for each real disease, as the cost, time, and risk of these is not known in this hypothetical example.

Sensitivity and Specificity Trade Offs

Determining the priority between a high sensitivity and specificity is crucial for developing robust classification models. Achieving the right prioritisation depends on the specific application and the consequences of false positives and false negatives. There are some strategies and guidelines to help you balance these metrics:

Adjusting the Decision Threshold

The decision threshold is the probability cutoff at which a model assigns a positive label. By adjusting this threshold, you can shift the prioritisation between sensitivity and specificity, e.g., lowering the threshold increases sensitivity (more positives identified) but decreases specificity (more false positives) and vice versa.

Cost-Benefit Analysis

Evaluate the costs and benefits of false positives and false negatives for your specific application. This analysis helps determine whether to prioritise sensitivity or specificity. In disease screening, the cost of missing a disease (false negative) is often higher than the cost of a false alarm (false positive), thus prioritising sensitivity.

We can also introduce other metrics, using the values that make up the confusion matrix to complement the insights from using specificity and sensitivity, such as precision or F1 score.

It might seem that using many metrics (even ALL the metrics!) would be the way to get the most robust model. However, as can be seen with sensitivity and specificity, there may be trade-offs in terms of performance due to conflicting focuses of the metrics. There is also a resource cost involved with increasing number of metrics to be optimised, if it is even possible. For each case, there will be an optimum choice and number of metrics to use based on the resources available for the project (including, time, money, computational power).

Round Up

Understanding and correctly using specificity and sensitivity is fundamental in developing robust classification models, with known weaknesses and strengths. These metrics provide insights into a model’s performance, particularly in distinguishing between positive and negative labels. Sensitivity measures the model’s ability to correctly identify positive labels, and specificity evaluates the model’s ability to correctly identify negative labels.

Incorporating specificity and sensitivity into your evaluation process ensures an optimized approach to model performance, helping you to make informed decisions based on the trade-offs between these metrics.

Using metrics to evaluate models and understand their strengths and weaknesses is a vital part of Trusted Science. Making the models more understandable to the users so they can make an informed choice about using them. Machine Learning should not just be a black box we know nothing about!