CODEX

Understanding large confusion matrices.

Andrew Ozbun
CodeX
Published in
6 min readMar 25, 2021

--

The confusion matrix is a quintessential part of our work as data scientists. Our bread and butter; it is a form of visualizing the performance of our model. Tackling this remains relatively simple for two classes, but as our matrix balloons calculations can become muddy.

Technical terminology associated and reviewed in this article includes true positives, true negatives, false positives, and false negatives which in turn yields the true and false positive rates as well as true and false negative rates. We can also evaluate metrics like accuracy, precision, recall and F1 scores.

We are going to hit three main points to succinctly describe how to understand larger confusion matrices when tackling multi-class issues.

  1. How is the confusion matrix constructed.
  2. What are the working metrics we get from a confusion matrix.
  3. How do we apply this on a large scale for multi-class problems.

How is a confusion matrix constructed?

A confusion matrix in its most simple form is created with the outcome of only two classes. For the purposes of demonstration I am going to use a project I did on bank loans where the bank was trying to determine one thing: Will a person accept or reject a personal loan? Subsequently, this means that we classify people into two groups: Accepted and Rejected. To analyze the accuracy of our predictions, the confusion matrix is made out of four possible results as depicted below:

Outlined above is the confusion matrix for the bank loan project and how we are reading it.

True & False Positives and True & False Negatives.

The confusion matrix is essentially a visual representation of the probability for a total population. A probability tree similar to one you might draw for a coin toss in a beginning stats class is depicted below. You can think of this as two groups with two categories creating four outcomes:

True Positive(TP): The true positive rate is the number of people the machine accurately predicted to have a positive outcome. In our case that was the number of people who the computer thought would accept a bank loan and did.

True Negative(TN): The true negative rate is the number of people the machine accurately predicted to have a negative outcome. In our case it was the number of people who the computer thought would NOT accept the bank loan and they actually did not.

False Positive(FP): The false positive rate is the number of people the machine INACCURATELY predicted to have a positive outcome. In other words the number of people the machine thought would accept the banks offer but they actually rejected it. Also known as a Type I error.

False Negative(FN): The false negative rate is the number of people the machine INACCURATELY predicted to have a negative outcome. In our example the number of people the machine thought would reject the personal loan but in real life want to accept it. Also known as a Type II error.

A note on Type I and Type II errors — in the case of whether the bank should send out a pamphlet that costs less than $1 to mail, the stakes are very low. In the case that we are talking about a medical diagnosis, the accuracy of these errors become exponentially important.

How does this provide us with usable metrics?

The rest is basic algebra from here, plugging in the numbers from the confusion matrix. True and False Positive & True and False Negative give us insight into four basic calculations (seen below):

Using these metrics we can quantify in percentage (or 0 to 1) the frequency of occurrence of each possible scenario.

Precision, Recall, Accuracy and F1 Scores.

With an understanding of these rates, we can deduce the four main important metrics from the confusion matrix: precision, recall, accuracy, and F1 scores.

Precision: Precision quantifies the number of positive class predictions that actually belong to the positive class.

Recall: Recall, also the same as the True Positive Rate, quantifies the number of positive class predictions made out of all positive examples in the dataset.

Accuracy: Accuracy is the total number of correct predictions divided by the total number of predictions made for a dataset.

F1 Score: This measurement provides a single score that balances both the concerns of precision and recall in one number.

When we define these quantitatively it looks like this:

How is this applied to a larger scale?

Understanding the confusion matrix for the bank loan project was a simple feat. When posed with a larger classification problem this became quite a headache.

As a project I was doing during bootcamp, I attempted to classify 10 species of monkeys using image recognition with neural networks. The resulting confusion matrix in numeric only format is below, and it is confusing.

Please note that the accuracy of this project was very low, for more information please visit my github.

I eventually made an example confusion matrix that used color coding so that I could understand what values were being assigned to which metrics. Simply put:

The true positive (green) is the cell that lies at the intersection of the same species on the grid. These are the images the computer thought was a Patas Monkey and was.

The false positives (blue) are the sum of cells in the vertical column under predicted values containing the relevant species minus the cell with the true positive. These are the images the computer thought was the Patas Monkey but was actually another species.

The false negatives (red) are the sum of cells in the horizontal row under actual values containing the relevant species minus the cell with the true positive. These are the images the computer did NOT think was a Patas Monkey but actually was.

The true negative (yellow) are the sum of all cells which are not in the vertical or horizontal row or vertical column of the relevant species. These are the images the computer did NOT think was a Patas Monkey and in real life was not.

Shown above is the confusion matrix explanation for the Patas Monkey in a classification for 10 species of Monkey.

For this specific example, the basic addition has been done below from the numeric confusion matrix depicted earlier.

With these basic calculations one can go on to calculate the frequency of each occurrence, precision, accuracy, recall, and F1 score. When understanding how to derive those four numbers the ability to calculate and interpret results becomes drastically simplified.

Conclusion.

The original thought of diving into multi-class classification is always so exciting. It is a step beyond the simplicity of binary classification allowing your mind to race with possible questions and hypotheses. The pitfall of complexity is complexity, however. What I mean to say is that with the excitement of exploring infinite classes comes infinite complications. Taking the time to slow down and hand piece together a multi-class confusion matrix can offer a deeper and more thoughtful understanding of your results as well as your process.

Resources:

https://towardsdatascience.com/performance-metrics-confusion-matrix-precision-recall-and-f1-score-a8fe076a2262

https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/

--

--