Statistics of Confusion Matrix

Published in

Machine Learning Turkiye

6 min readFeb 26, 2022

In this article, I want to share with you basic information about Supervised Learning, Logistic Regression and Confusion Matrix.

Firstly I explain to what is a Supervised Learning,

The concept of supervised learning focuses on labelling of training data.

How it’s work?

During the training phase, the system is fed with labeled datasets that tell the system what the output is for each particular input value.
The trained model is then presented with test data: This is labeled data, but the labels are not exposed to the algorithm.
The purpose of the test, data is to measure how accurately the algorithm will perform on unlabeled data.

There are many machine learning methods under supervised learning but in this article I’ll focus on Logistic Regression.

What is the Logistic Regression?

When target is categorical, Logistic Regression powerful supervised ML algorithm used for binary classification problems. The most common Logistic Regression models a binary outcome but at the same time model scenarios where there are more than two possible discrete outcomes so it called Multinomial Logistic Regression.

It gives the probabilistic values which lie between 0 and 1. It’s shown as ;

y = ln(P/(1-P))

It’s possible to express Logistic Regression with the Sigmoid Function.

We have to set threshold values to determine which class the variables belong to. That’s why we need to Sigmoid Function. If the variable is below the threshold value, it takes a value of 0, if it’s above the threshold value, it takes a value of 1.

Assumption for the Logistic Regression

The dependent variable must be categorical in nature.
The independent variable shouldn’t have multicollinearity.

If so, What is the Multicollinearity?

Multicollinearity exists when one or more independent variables are highly correlated with each other which makes the variable in question unreliable as a predictor in a regression equation.

Therefore, the confidence interval of the examined coefficient becomes very wide and it becomes difficult to reject the H0 hypothesis.

As we saw above, multicollinearity is seen between the x, y, z variables.

How Do We Test for It?

It’s possible to test it with the VIF( Variance Inflation Factor)

Dependent variable is removed for a while
Each independent variable is regressed with all other independent variables
If the variables truly independent of each other, R2 will result in very small and the equation will lean towards 1.
The closer to 1, the more ideal scenario for predictive modeling.

What Should be Done to Prevent Multicollinearity?

Highly correlated independent variables can be deleted in the dataset. In this case we have to know domain.
PCA (Principal Component Analysis) can be used.
PLS (Partial Least Squares) can be used.

Confusion Matrix

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class.

Function Parameters:

y true (the actual values)
y pred (the targeted value return by the classifier)

Where Does True, False, Negative, Positive Come From?

The class to be predicted takes it’s Positive Value, other class takes Negative value.
Correct predictions takes True, wrong predictions takes False.

So, TP, TN, FP, FN occured.

If so, we have to create a confusion matrix with the size of the classification operating we’re doing. For binary 2x2, with 5 classes 5x5.

How many data in the dataset, the sum of the values in the confusion matrix should be equal to number of data so we can check the correctness.

Relationship of Confusion Matrix with Type I and Type II Error

In statistic,

The H0 (Null Hypothesis) is established by assuming that there is no difference between groups or no relationship between variables in the population.

The HI (Alternative Hypothesis) is established with an assumption that there is difference between groups or a real relationship between the variables.

As a result of a hypothesis test , there are four possible outcomes according to the sample statistics.

H0 is actually true and not rejected.
H0 is actually true and rejected.
H0 is actually false and not rejected.
H0 is actually false and rejected.

Type I Error -> False Positive Rate -> a significance level

Type II Error -> False Negative Rate -> ß power of the test statistic

If your results fall in the critical region of this curve, they are considered statistically significant and the null hypothesis is rejected. However, this is a false positive conclusion, because the null hypothesis is actually true.

The remaining area under the curve represents statistical power, which is
1 — β.
Increasing the statistical power of your test directly decreases the risk of making a Type II error.

It may be necessary to test on more than one sample to avoid Type I Error and may be necessary increase the statistical power of the test Type II Error.

Now that all steps completed, lets get started reviewing the metrics of Confusion Matrix!

Accuracy

Classification accuracy is the ratio of correct predictions to total predictions made.

Accuracy gives us the answer to the question: How often is the classifier correct?

It’s possible to encounter many different cases in classification problems for this reason it’s not enough to examine only Accuracy.

So what we should do in this case?

Actually, the unbalanced data problems I mean. Unbalanced Data Problem occurs when one class suppresses another class. At the same time refers to types of datasets where the target class has an uneven distribution of observations.

Thus situated, it’s useful to examine other metrics.

Precision: Deals with how many of the class we predict positive are actually positive.
Recall: Deals with how many of the positive class we should have predicted positive.