Confusion Matrix Part 2

Mukesh Chaudhary
3 min readMay 15, 2020

--

Part 2 : Evaluation via Confusion matrix for unbalance data , multi class classifier

Before read this blog , i strongly recommend to read part1 blog of confusion matrix . There i explained some basic terms and measures of confusion matrix . Here, i will try to explain how to evaluate performance of any machine learning model when they have unbalance data because if we check accuracy , it always say 95% or more . Due to unbalance data like one class has 98% data and another has 2% data, it always shows 98% accuracy. And it is also true . Machine learning will predict all time same category data . when we focus only accuracy , then we will got bad evaluation of performance of model . So , how to handle this problem . We will see below with example .

Second section , we will try to understand how to interpret multi class classification data of machine learning via confusion matrix . Let’s start

Here , i takes unbalance data of detect credit fraud from kaggle for only example. Before go furthere let’s look libraries which we use .

# import neccessary librariesimport pandas as pd
import numpy as np
df_original = pd.read_csv("creditcard.csv")
df = df_original.copy()
# check balanced of Classprint("\n0:No Fraud | 1: Fraud ")
print("************************\n")
print(df['Class'].value_counts())
df['Class'].value_counts().plot(kind='bar')
plt.title(" Class Distribution \n 0:No Fraud | 1: Fraud")

Output:

On above example , we can clearly see that how much unbalance data is there on detect fraud database almost 99.82% right transaction and only 0.18 % fraud transaction . On the case , Machine learning always shows 99% accuracy . What is that ? is it perfect accuracy due to 99% or something else ?

I would like to say that when we have dataset with 90:10 ratio , then we could say roughly that is unbalance data . Another way is confusion matrix . Let’s figure out how to know unbalance data via confusion matrix. Example of unbalance confusion matrix

Accuracy :

Above , We can see clearly that how machine learning predict very low number of one class . It’s total bias . So , We can interpret unbalance data via confusion matrix as well as data visualization. We can solve this problem through one approach of many approach which is SMOTE class from imblearn.over_sampling module.

from imblearn.over_sampling import SMOTE

Let’s move another topic multiclass classification and see first example :

Here , I just try to say if we have mutliclass on target feature , then we have to focus diagonal values on confusion matrix . Rest of numbers are False Negative or False Positive .

Let’s see another example with python code of dataset of iris from sklearn.

# load multiclass classificationfrom sklearn import datasets
data = datasets.load_iris()
print(data.keys())
X = data.data
y = data.target
xg = XGBClassifier()
xg.fit(X,y)
xg.score(X,y)
pred_iris = xg.predict(X)
print("Confusion Matrix:\n")
print(confusion_matrix(y,pred_iris))
sns.heatmap(confusion_matrix(y,pred_iris),annot= True)

Output:

Above , Dataset has few numbers data . So machine predict 100% accuracy . However , I am trying say that whenever we have multiclass target data , then we always look around diagonal data.

--

--