Accuracy Paradox. How not to fall into the trap of unbalanced dataset.

Published in

Code the Dice

7 min readNov 4, 2020

It’s no secret that when working with data in the field of machine learning (and not only), a significant part of the time is spent on studying the dataset, cleaning the data, feature engineering, and only a relatively small part of the efforts goes to building, evaluating and tuning our models.

This distribution of time is quite reasonable, since the accuracy and reliability of our results directly depends on the quality and convenience of our dataset.

But sometimes, due to lack of time or sheer laziness, data scientists ignore these important steps, hoping that smart classification or regression algorithms will figure out the data themselves.At this point, there is a pretty high probability of getting a low-quality or even useless model and, as a result, unreliable results.

In this article I will give an example of one of these unpleasant situations (the Accuracy Paradox) and show you how to deal with an unbalanced dataset in order to maintain the quality of the model.

I used a dataset from “Health Insurance Cross Sell Prediction” task on Kaggle. In short, our task is to build a model to predict whether the health insurance policyholders from past year will also be interested in Vehicle Insurance provided by the company.

Feel free to access my source code for this project on my Github. In the README.txt you will find a short description of all files.

Part 1: Accuracy Paradox & Imbalanced dataset

For the start, lets import all packages needed and load our data.

import pandas as pd
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import seaborn as sns# models
from sklearn.linear_model import LinearRegression,LogisticRegression, RidgeClassifier, SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

#loading data
data = pd.read_csv("6.Train.csv")

Now let’s take a look at our data:

It seems that we can drop “id” feature. We also need to turn “Gender”, “Vehicle_Age” and “Vehicle_Damage” columns into numeric data. Let us do these changes and look at the data again:

#feature engineering
data = data.drop("id", axis=1)
data["Genders"] = data.Gender.apply(lambda x: 0  if x == "Male" else 1)
data =  data.drop("Gender",axis=1)
data.replace({"< 1 Year":0, "1-2 Year":1, "> 2 Years":2}, inplace=True)
data.Vehicle_Damage.replace({"Yes": 1, "No":0}, inplace=True)
data.drop_duplicates(inplace=True)

Well it looks better, doesn’t it?

Since we are too lazy to take a deeper look at our dataset, we move on to preparing the training and validation datasets:

features = data.drop("Response",axis=1)
targets = data["Response"]

#splitting data
features_train, features_val, targets_train, targets_val = train_test_split(features, targets, test_size=0.2, random_state=12)

Finally we can build our first classification model. For the start, lets try Logistic Regression and check its accuracy score:

#building and evaluating model
LogReg = LogisticRegression()
LogReg.fit(features_train, targets_train)
acc_log_reg_train = round(LogReg.score(features_train, targets_train) * 100, 2)
acc_log_reg_val = round(LogReg.score(features_val, targets_val) * 100, 2)
print("LogReg accuracy score train " + str(acc_log_reg_train))
print("LogReg accuracy score val " + str(acc_log_reg_val))

Wow! 88%. Not a bad result, considering that we almost did not improve our initial dataset. It seems that our model is close to be a perfect. But, lets use one more advance model evaluating metric called “f1 score” (if your first association is racing, then you probably need to read more about this metric before continuing), where 0 is the worst result and 1 is the best one:

guesses = LogReg.predict(features_val)
f1_score = metrics.f1_score(targets_val, guesses)
print("LogReg f1 " + str(f1_score))

But how? So high accuracy and so terrible f1 score. Let’s compare our predictions and true labels using .value_counts:

It looks really crazy: our model hardly recognizes class 1 (i.e. potential buyers). Ask why? I think you can already guess what the problem is.

Try to look at the initial dataset one more time, namely at our target column (“Response”).

So here we see that dataset given on Kaggle has only 14 % of positive observations (who actually bought the insurance). The rest are negative observations. So, we deal with an unbalanced dataset.

Therefore, our model cannot identify a positive class well. It just doesn’t have enough observations in training set to generalize.

And since the majority of observations in training and validation set are negative, and our model almost always predicts negative class, the accuracy score of this model is high. However, the predictive power of the model is close to zero.

Congratulation! We just got in the “Accuracy Paradox” trap.

On the other hand I have good news for you as well. There are several methods to fix the problem, and it’s not gonna take a lot of your time.

In this article I will tell about 2 simple methods for dealing with unbalanced datasets: undersampling and oversampling.

Part 2: Undersampling

Simply put, Undersampling is the reduction of the number of observations of the dominant class, approaching to the number of minor class observations.

We already know that our dominating class is “0” — 334 155 observations; and the minor class is “1” — 46 685 observations. Great difference. Now we need to shorten the number of minor class observations, and not necessarily to 46685, the ratio of classes can be 60/40.

I played with the numbers a little and found out that 63 000 is the optimal number of observations of the negative class that needs to be stored in our dataset. Let’s now we can do undersampling and check what we have in modified dataset:

#undersampling
positive_data = data[data["Response"] == 1]
negative_data = data[data["Response"] == 0]
short_negative_data = negative_data.iloc[:63000,]
prepared_data = pd.concat([positive_data, short_negative_data])
print(prepared_data.Response.value_counts())
sns.countplot(prepared_data.Response)
plt.show()

So, now the ratio of classes is almost 43/57, which means that our dataset is balanced enough. We can move on to splitting our general dataset on training and validation sets and finally building our model:

features = prepared_data.drop("Response",axis=1)
target = prepared_data["Response"]#train_test_split
features_train, features_val, targets_train, targets_val = train_test_split(features, target, test_size=0.2, random_state=12)#building and evaluating models
LogReg = LogisticRegression()
LogReg.fit(features_train, targets_train)
acc_log_reg_train = round(LogReg.score(features_train, targets_train) * 100, 2)
acc_log_reg_val = round(LogReg.score(features_val, targets_val) * 100, 2)
print("LogReg accuracy score train " + str(acc_log_reg_train))
print("LogReg accuracy score val " + str(acc_log_reg_val))
guesses = LogReg.predict(features_val)
f1_score = metrics.f1_score(targets_val, guesses)
print("LogReg f1 " + str(f1_score))

So, the accuracy scores of updated model is not as high as it was before, but the f1 score explicitly says that this model is far more better and can be used for predictions.

The evaluating metrics are not very high because we have reduced the total number of observations. However, you can try to do some additional feature engineering in order to improve the results.

Part 3: Oversampling

Oversampling is an increase in the number of minor class examples in order to balance the dataset. It can be done in different ways but this time I will use the SMOTE (Synthetic Minority Oversampling Technique) algorithm. Click here to learn more. This strategy is based on the idea of generating a number of artificial examples that would be “similar” to those in the minority class, but at the same time would not duplicate them. You can find this algorithm in imblearn package on python.

I picked this way because it will take only 2 lines of your code to balance the dataset. Follow me:

features = data.drop("Response",axis=1)
target = data["Response"]

#balancing the data
oversampler = SMOTE(random_state=2)
new_features, new_targets = oversampler.fit_sample(features,target)

Now, we can check whether our dataset is balanced:

sns.countplot(new_targets)
plt.show()

Yes, our dataset is already balanced. At the same time we have many observations. Let’s build several different models and evaluate them:

LogReg = LogisticRegression()
LogReg.fit(features_train, targets_train)
acc_log_reg_train = round(LogReg.score(features_train, targets_train) * 100, 2)
acc_log_reg_val = round(LogReg.score(features_val, targets_val) * 100, 2)
print("LogReg accuracy train " + str(acc_log_reg_train))
print("LogReg accuracy val " + str(acc_log_reg_val))
guesses = LogReg.predict(features_val)
f1_score = metrics.f1_score(targets_val, guesses)
print("LogReg f1 " + str(f1_score))


gaussian = GaussianNB()
gaussian.fit(features_train, targets_train)
acc_gaussian_train = round(gaussian.score(features_train, targets_train) * 100, 2)
acc_gaussian_val = round(gaussian.score(features_val, targets_val) * 100, 2)
print("NB accuracy train " + str(acc_gaussian_train))
print("NB accuracy val " + str(acc_gaussian_val))
guesses1 = gaussian.predict(features_val)
f1_score = metrics.f1_score(targets_val, guesses1)
print("NB f1 " + str(f1_score))

decision_tree = DecisionTreeClassifier()
decision_tree.fit(features_train, targets_train)
acc_decision_tree_train = round(decision_tree.score(features_train, targets_train) * 100, 2)
acc_decision_tree_val = round(decision_tree.score(features_val, targets_val) * 100, 2)
print("decision tree accuracy train" + str(acc_decision_tree_train))
print("decision tree accuracy val" + str(acc_decision_tree_val))
guesses2 = decision_tree.predict(features_val)
f1_scoreDT = metrics.f1_score(targets_val, guesses2)
print("decision tree " + str(f1_scoreDT))

KNC = KNeighborsClassifier(n_neighbors=2)
KNC.fit(features_train, targets_train)
acc_KNC_train = round(KNC.score(features_train, targets_train) * 100, 2)
acc_KNC_val = round(KNC.score(features_val, targets_val) * 100, 2)
print("K Neighbor accuracy train" + str(acc_KNC_train))
print("K Neighbor accuracy val" + str(acc_KNC_val))
guesses3 = KNC.predict(features_val)
f1_scoreKNC = metrics.f1_score(targets_val, guesses3)
print("K Neighbor f1 " + str(f1_scoreKNC))

Well, I think this is a victory! The results of the models after oversampling are very good. Just look at the decision tree and K-Neighbor classifier scores!

I hope this article was useful for you. Feel free to comment on this article.

Accuracy Paradox. How not to fall into the trap of unbalanced dataset.

Part 1: Accuracy Paradox & Imbalanced dataset

Part 2: Undersampling

Part 3: Oversampling

Written by Vitalii Panchyk