Introduction to Confusion Matrix [Classification Modeling]

Roshan Alwis
Tech Vision
Published in
3 min readOct 30, 2016

In classification modeling, it generates a confusion matrix which describes the model’s ability identify the existing classes in a given data set. Based on the information in confusion matrix, it is possible to derive so many metrics to measure the fitness of a model to a particular context.

Figure 1: Sample confusion matrix layout for binary classification model

Note : For multi-class classification, matrix can be N by N where N is the number of classes in the data set.

True Positives and True Negatives indicate the classes which have identified by the model correctly. True Positive states that a model predicts a class as true, where the class is actually true. and vice versa for True Negative.

Figure 2: Type 2& Type 1 errors

False Positives and False Negatives can be considered as the errors in the model. FP is considered as a type 1 error and FN is considered as a type 2 error. Generally, type 2 errors cause many troubles by misleading us.

For example, take a heart patient recognition system. The system might recognize a patient as a heart patient who is actually not. But it is not risky. After going through some tests, we can verify that the patient is not a heart patient. But what if the system is failed to identify a patient as a heart patient? The patient will not go for further testing and eventually end up with a tragedy.

Measures

Accuracy = (TP + TN) / (TP + TN + FP +FN)

Recall = TP / (TP + FN)

Precision = TP / (TP + FP)

F1 Score = 2(Precision * Recall) / (Precision + Recall) [1-Best, 0-Worst]

Am I good or Am I bad?

By looking at a confusion matrix, there are things that we can directly see. We can see how the classes have distributed and their values. We are looking for balance classes to be trained. What if the classes are not balanced.

For example,

This is an example of an imbalanced data set.

Figure 3: Generated confusion matrix for an imbalanced data set

Let’s do some calculations.

Accuracy = (5135 + 0) / (5135 + 0 + 220 + 75) = 94.56%

Recall = 5135 / (5135 + 75) =0.985

Precision = 5135 / (5135 + 220) =0.958

F1 Score = 2(0.985 * 0.958) / (0.985 + 0.958) = 0.971

You can see, the above numbers are seemed to be pretty good. You can see, above numbers seemed to be pretty good. But hey! Did you notice that the model is totally failed to identify one class? Because the given data set has imbalance classes, where the model only able to identify the majority class.

You need to perform an adaptive learning to overcome this class imbalance problem. By giving data as it is, we let machine learning algorithms to get into wrong conclusions.

Discussion

Do not stuck into values. You have also need to get to know about the nature of the data. Sometimes it is hard to make all the classes equal. Because some data might costly than others. For example, to collect failure data of a machine costs more than collecting the data at operation.

To overcome this class imbalance problem there are things that you can do.

  1. Under Sampling
  2. Over Sampling
  3. Generation of synthetic data
  4. Define cost function for learning

I hope to discuss on this in my future posts.

--

--

Roshan Alwis
Tech Vision

Software Engineer at Sysco Labs. (Computer Science & Engineering Graduand at University of Moratuwa)