Confusion Matrix Part 2
Part 2 : Evaluation via Confusion matrix for unbalance data , multi class classifier
Before read this blog , i strongly recommend to read part1 blog of confusion matrix . There i explained some basic terms and measures of confusion matrix . Here, i will try to explain how to evaluate performance of any machine learning model when they have unbalance data because if we check accuracy , it always say 95% or more . Due to unbalance data like one class has 98% data and another has 2% data, it always shows 98% accuracy. And it is also true . Machine learning will predict all time same category data . when we focus only accuracy , then we will got bad evaluation of performance of model . So , how to handle this problem . We will see below with example .
Second section , we will try to understand how to interpret multi class classification data of machine learning via confusion matrix . Let’s start
Here , i takes unbalance data of detect credit fraud from kaggle for only example. Before go furthere let’s look libraries which we use .
# import neccessary librariesimport pandas as pd
import numpy as np
df_original = pd.read_csv("creditcard.csv")
df = df_original.copy()# check balanced of Classprint("\n0:No Fraud | 1: Fraud ")
print("************************\n")
print(df['Class'].value_counts())df['Class'].value_counts().plot(kind='bar')
plt.title(" Class Distribution \n 0:No Fraud | 1: Fraud")
Output:
On above example , we can clearly see that how much unbalance data is there on detect fraud database almost 99.82% right transaction and only 0.18 % fraud transaction . On the case , Machine learning always shows 99% accuracy . What is that ? is it perfect accuracy due to 99% or something else ?
I would like to say that when we have dataset with 90:10 ratio , then we could say roughly that is unbalance data . Another way is confusion matrix . Let’s figure out how to know unbalance data via confusion matrix. Example of unbalance confusion matrix
Accuracy :
Above , We can see clearly that how machine learning predict very low number of one class . It’s total bias . So , We can interpret unbalance data via confusion matrix as well as data visualization. We can solve this problem through one approach of many approach which is SMOTE class from imblearn.over_sampling module.
from imblearn.over_sampling import SMOTE
Let’s move another topic multiclass classification and see first example :
Here , I just try to say if we have mutliclass on target feature , then we have to focus diagonal values on confusion matrix . Rest of numbers are False Negative or False Positive .
Let’s see another example with python code of dataset of iris from sklearn.
# load multiclass classificationfrom sklearn import datasets
data = datasets.load_iris()
print(data.keys())X = data.data
y = data.targetxg = XGBClassifier()
xg.fit(X,y)
xg.score(X,y)
pred_iris = xg.predict(X)print("Confusion Matrix:\n")
print(confusion_matrix(y,pred_iris))sns.heatmap(confusion_matrix(y,pred_iris),annot= True)
Output:
Above , Dataset has few numbers data . So machine predict 100% accuracy . However , I am trying say that whenever we have multiclass target data , then we always look around diagonal data.