When we are training machine learning algorithms for various classification tasks, accuracy seems to be a meaningful metric. For example, if we are training a classifier to predict whether a picture contains a cat or a dog, accuracy might be a good way to measure the performance of the trained model. Model accuracy is defined as the number of classifications a model correctly predicts divided by the total number of predictions made. So higher the model accuracy, the better the model.
But is this always true? Accuracy can be a good metric to look at for many scenarios, but for some problems it can be very misleading. Especially in cases in which we are modelling something which is a rare phenomenon. For example, if I have to create a model to predict whether a meteor will hit the Earth tomorrow and if I always predict false, I might be right most of the times! The accuracy of this model will be quite high, but it does not give us any useful information at all.
Thus we see that accuracy is not always the best way to measure the predictive capability of the model. In fact, accuracy can always be misleading if our classification problem is imbalanced. For example, determining whether a patient has cancer. The biggest drawback of accuracy is that it treats each error as equal. In many scenarios that might not be the case. So how do we measure model’s effectiveness in this case?
Let’s have a common vocabulary so we do not confuse ourselves.
true positive — model predicts cancer when the person has cancer
true negative — model predicts no cancer when the person does not have cancer
false positive — model predicts cancer when the person does not have cancer
false negative — model predicts no cancer when the person has cancer
True positive and true negatives happen when the model predicts correctly. False negatives and false positives are errors in predictions. For a binary classification problem only these two kind of errors can happen.
In the cancer detection example, we are most concerned with finding the patients who actually have cancer and we want to maximize our model’s ability to find them. Patients who have cancer are called true positives and ones which don’t have cancer are called true negatives. Since in this case we are most concerned about true positives we try to maximize them. In statistics this metric is known as recall, the ability of the classification model to identify all relevant examples. It gives us what proportion of actual positives were identified correctly.
Recall = true positive / (true positive + false negative)
Suppose we are using a naive model which predicts that every patient has no cancer. In that case true positives will be zero and hence we are going to end up with zero recall! We see that if we use recall as a metric this model becomes absolutely useless, but the same model can have a very high accuracy.
Let’s change the model we are using here. Now let us consider a model which always classifies the patient as having cancer. In this case the false negatives become zero and thus the recall is equal to one. At a cursory look a model with recall one might seem pretty good since it detects every patient with cancer, but it also mis-classifies every patient who does not have cancer! Our model is not very precise.
Precision is another metric which goes hand in hand with recall. It is the ability of the model to identify only relevant samples. It answers what proportion of positive identifications was actually correct.
Precision = (true positive) / (true positive + false positive)
Now let us calculate the precision of a model classifies every patient as having cancer. True positives in this case are very less since only a small fraction of the sample will actually have cancer and false positives will be quite high since the model is classifying everyone is having cancer. The precision in this case will be very close to zero. So we went from one extreme to another. A model that has a very high recall has very less precision.
More often than not precision and recall are always in tension. That is, improving precision typically reduces recall and vice versa. In the case of cancer detection we usually want our model to have high recall. Since you do not want a scenario where the model classifies the person as not having cancer when the person is actually infected. We do not want false negatives in this case. But then we are willing to have some false positives since there can be times when our model predicts the person as having cancer when actually the person is not infected.
There is no wrong or right answer as to what metric you should focus at while training a classifier. Most of the times it depends ion the problem at hand. A more balanced metric to also consider is the F1 score. It is the harmonic mean of precision and recall. This metric gives equal weight to both precision and recall and punishes extreme values.
F1 score = 2 (recall)(precision) / (recall + precision)
Thus, if either recall or precision is zero, the F1 score would be zero. Our naive model which had a recall of 1 and precision ~ 0 will have a very low F1 score.
For a more involved discussion around types of errors in binary classification and how we design classifiers you can have a look at one of my previous blog posts.