Introduction to Confusion Matrix [Classification Modeling]
In classification modeling, it generates a confusion matrix which describes the model’s ability identify the existing classes in a given data set. Based on the information in confusion matrix, it is possible to derive so many metrics to measure the fitness of a model to a particular context.
Note : For multi-class classification, matrix can be N by N where N is the number of classes in the data set.
True Positives and True Negatives indicate the classes which have identified by the model correctly. True Positive states that a model predicts a class as true, where the class is actually true. and vice versa for True Negative.
False Positives and False Negatives can be considered as the errors in the model. FP is considered as a type 1 error and FN is considered as a type 2 error. Generally, type 2 errors cause many troubles by misleading us.
For example, take a heart patient recognition system. The system might recognize a patient as a heart patient who is actually not. But it is not risky. After going through some tests, we can verify that the patient is not a heart patient. But what if the system is failed to identify a patient as a heart patient? The patient will not go for further testing and eventually end up with a tragedy.
Measures
Accuracy = (TP + TN) / (TP + TN + FP +FN)
Recall = TP / (TP + FN)
Precision = TP / (TP + FP)
F1 Score = 2(Precision * Recall) / (Precision + Recall) [1-Best, 0-Worst]
Am I good or Am I bad?
By looking at a confusion matrix, there are things that we can directly see. We can see how the classes have distributed and their values. We are looking for balance classes to be trained. What if the classes are not balanced.
For example,
This is an example of an imbalanced data set.
Let’s do some calculations.
Accuracy = (5135 + 0) / (5135 + 0 + 220 + 75) = 94.56%
Recall = 5135 / (5135 + 75) =0.985
Precision = 5135 / (5135 + 220) =0.958
F1 Score = 2(0.985 * 0.958) / (0.985 + 0.958) = 0.971
You can see, the above numbers are seemed to be pretty good. You can see, above numbers seemed to be pretty good. But hey! Did you notice that the model is totally failed to identify one class? Because the given data set has imbalance classes, where the model only able to identify the majority class.
You need to perform an adaptive learning to overcome this class imbalance problem. By giving data as it is, we let machine learning algorithms to get into wrong conclusions.
Discussion
Do not stuck into values. You have also need to get to know about the nature of the data. Sometimes it is hard to make all the classes equal. Because some data might costly than others. For example, to collect failure data of a machine costs more than collecting the data at operation.
To overcome this class imbalance problem there are things that you can do.
- Under Sampling
- Over Sampling
- Generation of synthetic data
- Define cost function for learning
I hope to discuss on this in my future posts.