Machine learning and deep learning technologies are being developed to solve several problems such as face recognition, traffic sign classification and people have been showing interests to solve medical problems using machine learning classifiers recently. For example: tissue classification, prediction of normal vs abnormal patients, prediction of treatment response, etc. There are several diagnostic tests which can be performed by using different classifiers. However, the evaluation of a diagnostic test is a matter of concern in modern science. Confirmation of the presence of a disease is important but along with that ruling out the presence of disease in healthy patients is also necessary. Sensitivity and specificity are the two commonly used terms to quantify the accuracy of any diagnostic test. In this post, I will discuss these two terms and how they are computed. This post also explains the meaningful interpretations of other similar parameters which can be used to determine efficacy and adoption of a diagnostic test.
When a diagnostic test detects whether there is a tumor in a mammogram or not, it can have either positive or negative test result about the presence of disease. The performance of a binary diagnostic test on groups with and without disease can be more easily understood with a 2×2 decision chart.
There are 4 parameters in the 2×2 decision chart
1) True positive (TP): It says a diagnostic test result is positive given a person is actually diseased or having a tumor.
2) True negative (TN): It says a diagnostic test result is negative given a person is actually healthy or not diseased.
3) False positive (FP): It says a diagnostic test result is positive given a person is healthy or not diseased. This depicts that a diagnostic test is unable to identify a normal person and gave a false positive diagnosis. This could lead to expensive and painful treatment on a healthy patient which could have deleterious side-effects on the patient’s health.
4) False negative (FN): It says a diagnostic test result is negative given a person is diseased. This depicts that a diagnostic test is unable to identify a diseased patient and gave a false negative diagnosis. This will lead the disease undetected and can increase the probability of having the disease severe at the later stage and could be very harmful for a patient.
All the 4 parameters are important to take into consideration to identify the efficacy of a diagnostic test. Let’s take an example of a 2×2 decision chart given below.
There are several questions which can be answered using the above table.
- How many persons have a disease?
We can calculate number of patients corresponding to the row titled disease present. In other way, we can add TP + FN to find the total number of diseased patients from the binary decision chart.
TP +FN = 200
2. How many persons have no disease?
This can be calculated by summation of the number of persons corresponding to the row titled disease absent. In another way, we can add FP+TN to find the total number of healthy patients from the binary decision chart.
FP+TN = 1000
3. How many persons participated in the study?
TP +FN+FP+TN = 1200
4. How can we evaluate the performance of a diagnostic test?
Sensitivity and specificity can be used which is given below in detail:
Sensitivity = true positive fraction (TPF): Sensitivity is the ratio of TP to the actual number of positives/diseased. It refers to the probability of a positive test result for patients with disease i.e. conditional probability of correctly identifying the diseased subjects by a diagnostic test. Sensitivity is used to determine whether the test is sufficiently sensitive to pick up the disease.
Specificity = True negative fraction (TNF): It is the ratio of the number of TN to the actual number of negatives/healthy persons. It refers to the conditional probability of correctly identifying non-diseased subjects by test i.e. represents the proportion of people without disease who have a negative test result.
Sensitivity and Specificity are important statistics but there are other metrics which are used in different situations to evaluate the performance of a diagnostic test. Some examples in which different metrics can be used are given below:
Example 1: A patient undergoing an x-ray mammogram might ask, “If I have cancer possibly due to family history, what is my chance of getting a negative result and delaying life-saving treatment?
False negative fraction (FNF) or miss rate can be used to answer the above question. FNF is the ratio of FN to the actual number of positives/diseased persons.
Example 2: A patient undergoing an x-ray mammogram might ask, “If I am healthy, what is my chance of getting a positive result and undergoing a painful biopsy when I do not need one?
False positive fraction (FPF) or false alarm rate can be used to answer the above question. FPF is the ratio of FP to the actual number of negatives/healthy persons.
Example 3: An insurance company might ask, “What is the probability that a costly treatment will be correctly applied?
Positive predictive value (PPV) is the answer here. It is defined as the probability of disease for positive test results i.e. the ratio of number of TP to the actual number of predicted positives.
Negative predictive value (NPV): It is defined as the probability of being healthy for negative test results i.e. the ratio of number of TN to the actual number of predicted negatives.
Accuracy: It is defined as the probability of correctly diagnosed tests out of the total number of cases.
PPV and NPV are the two indices that are useful in clinical practice when test results are available for the clinicians. However, these are influenced by the prior prevalence of a disease in a population. If the prevalence of a disease is high, it is more likely that a person diagnosed as diseased actually has disease or PPV is elevated with a higher prevalence of disease while the NPV decreases with a higher prevalence.
See Test 1 and Test 2 below for its illustration
Same test (same sensitivity/specificity), gives a better PPV in a population with higher prevalence (0.20→ 0.38). That’s why an insurance company wants to spend its money on follow-up treatment or additional treatment, it will choose that place where PPV is high or in other words where disease prevalence is high. Similarly, it makes sense to screen smokers for lung cancer detection rather than general population as the prevalence of lung cancer would be high in smokers as compared to the general population. In conclusion, if a company wants to spend money on a diagnostic test, it will check the PPV and prevalence of the disease. However, sensitivity and specificity are the two conventional parameters which can be used to assess the effectiveness of a diagnostic test and they are independent of the prevalence of a disease.
A good diagnostic test has both high sensitivity and specificity. However, sensitivity is inversely related to the specificity and vice versa. Hence, a trade-off is needed between getting a diagnostic test with a high sensitivity of picking up a disease and high specificity of predicting the absence of the disease. An effective measure of the accuracy of a diagnostic test can be evaluated using receiver operating characteristic (ROC) curve which is a plot of sensitivity versus 1-Specificity. The area under the curve (AUC) is used to measure the diagnostic ability of tests and to compare different diagnostic tests on the same subjects to identify which works best.