Scoring a classification model can be fairly simple, however, there are many metrics used to evaluate a classification model. You should understand the nature of your problem when evaluating your model to understand which metrics are most important. The foundation of classification evaluation revolves around the use of a confusion matrix, seen below.
A confusion matrix is a representation of a model’s predictions. In this case, there are two outcomes: yes and no. Yes and no refer to whatever the model is evaluating. Lets assume in this example we are looking at breast cancer scans. No refers to there being no cancer and yes means cancer is present. The columns represent the model’s prediction and the rows represent the actual outcome.
The confusion matrix shows us the results of 165 observations denoted by “n”. We can see that in 50 cases, the model predicted there was no cancer and the model was correct that there was no cancer. This is called a true negative, denoted “TN” inside the matrix. There were 5 cases which our model predicted there was no cancer but in reality there was cancer present. This is called a false negative (FN) because we falsely predicted the patient was negative for cancer. There are 10 cases which we predict the patient has cancer but in reality the patient does not have cancer. This is called a false positive (FP) because we falsely predicted the patient was positive for cancer. The last cell represents true positives (TP). This represents cases which we predicted a patient would have cancer and they did in fact have cancer. The breakdown of these cells help us understand the strength of our model.
Accuracy score is one of the most common metrics you will come across. It is simply the amount of predictions that were correct over the total number of predictions made. In terms of a confusion matrix, accuracy score is (true positives + true negatives) / n.
In our example above our accuracy score would be [ (100 + 50) / 165 ] or 0.909.
What are the short comings of accuracy scores? Class imbalances. Let’s say 95% of women do not have breast cancer. If we simply predict that a woman doesn’t have breast cancer we will have a 95% accuracy score. For this reason, accuracy score can be misleading. If our model predicted correctly with 95% accuracy, many would assume that is a tremendous model, however, the model is essentially worthless being we can guess correctly 95% of the time. When presenting accuracy scores, a baseline score should be included to see how much our model beats the baseline. A baseline score in this case would be the percentage of the class with the highest observations.
Precision = True Positives / Total number of predicted positive.
In our example above, our precision would be ( 100 / 110 ) or 0.909
Recall / Sensitivity:
Recall = True Positives / Total number of real positives
In our example above, our recall would be ( 100 / 105 ) or 0.952
Specificity is the same as sensitivity except it looks at negatives instead of positives.
Specificity = True Negatives / Total number of real negatives.
In our example above, our specificity would be ( 50 / 60 ) or 0.833
How can I possibly remember this?
Remembering the difference between precision and recall is a challenge so I use the following trick:
Precision = TP / Predicted Positives
Recall = TP / Real Positives
Sensitivity and specificity only deal with real positives or real negatives, not predictions.
If you can remember that recall is the same as sensitivity and deal with positives, you will know that specificity has to deal with negatives. This technique is a bit clunky, but it’s the best I’ve got. If it doesn’t work for you try to find your own pneumonic and let me know what you come up with.
Why is it important to have these different metrics? Depending on one’s use case, each metric may have more or less relative importance. Accuracy score seems most obviously useful but it is not always the most important. In the case of breast cancer, would it be most useful to have a good accuracy score or limit the number of false negatives? This is a bit subjective but I would suggest limiting false negatives is of paramount importance. If a person has cancer but is told he/she does not have cancer, that person will not receive treatment. Thus, accuracy score would be relatively less important than limiting false negatives in this case.
Again, determining which metric is most useful is case specific and is somewhat subjective. That is why understanding the implications of each metric is so important. Without a grasp of each metric, you will not be able to fully understand the efficacy of your model.