Sticky Simple Explanation of Recall, Precision, Specificity, Sensitivity, Type I and II errors
Easy to get, easy to forget. Here is an attempt to make it stick to mind. We remember stories, so here is a small story.
Pi suspects that it has a disease affecting its rationality. It goes for a test for the disease with binary outcomes. Either it has or it does not i.e. either Pi is rational or irrational. This test is a Binary Classifier, with positive and negative outcomes. It would be nice if the test predicts positive only if Pi has the disease and negative if Pi does not. But tests make mistakes and Pi has to choose how to read the results or to choose the best test from a given set of tests.
Now, between errors also, there are bigger ones. And when Pi has to choose between two evils, Pi would be more accepting to lesser. While it is not nice for the test to be positive for Pi when it does not have the disease, but it is really bad if the test says negative when Pi has it (also called false negative). Bad because Pi goes untreated and even when it was rational, it becomes completely irrational. So, when in doubt we would like to flag it positive. That is the general intuition behind what is discussed below.
There is a class that is predicted and a class in actual. These two classes may or may not match. The naming of the match is from the prediction’s point of view. If the classifier truly predicts positive, it is true positive i.e. both predicted and actual class are positive. When classifier falsely predicts positive, false positive, the actual class is negative (counter to predicted class of positive). And so on.
So, false negatives are very bad for disease predicting test. Or, say Type II errors are very bad.
In this context, we have two questions, when the test says positive, how correct it is. And when the test says negative, how correct it is. Note this is different from total accuracy, which number of trues (positives and negatives) divided by total results (trues and negatives), i.e.
Accuracy = (TP+TN)/(TP+TN+FP+FN)
So, for the first question of how precisely test is pointing out positive, the measure is Precision,
Precision = TP/(TP+FP)
Second question, how precise test is at pointing out negatives, the measure is Specificity,
Specificity = TN / (TN+FP)
So Precision and Specificity are related. One is for positive, other for negatives. And the words look related — specific and precise. Let’s move on to recall.
Recall in the normal sense is whether you can fetch something that you remembered. How good is an information retrieval system to fetch all the relevant documents? In this case, how good is a system for finding all the positives? That is, how many positives were identified from a total of positives actually present. Note False Negatives are actually positives.
Recall = TP/(TP+FN)
Sensitivity is same as recall
Sensitivity = Recall
Ready for some more serious terms? We are almost at the end.
A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. In inferential statistics, the null hypothesis is a general statement or default position that there is nothing new happening like there is no association among groups or no relationship between two measured phenomena. Here the null hypothesis is true when the class is negative. So, in FP, the class was actually negative, i.e. the null hypothesis was true, but the prediction was contrary. This is called a type I error, where a true null hypothesis was rejected.
- A type I error (or error of the first kind) is the rejection of a true null hypothesis. Usually, a type I error leads to the conclusion that a supposed effect or relationship exists when in fact it does not. Examples of type I errors include a test that shows a patient to have a disease when in fact the patient does not have the disease, a fire alarm going on indicating a fire when in fact there is no fire, or an experiment indicating that a medical treatment should cure a disease when in fact it does not.
- A type II error (or error of the second kind) is the failure to reject a false null hypothesis. Some examples of type II errors are a blood test failing to detect the disease it was designed to detect, in a patient who really has the disease; a fire breaking out and the fire alarm does not ring; or a clinical trial of a medical treatment failing to show that the treatment works when really it does.
In disease classification Type II errors are bad. Prediction of no disease when a patient had would cause the patient to not be treated in time.