Decision Tree on Imbalanced Dataset

Rani Farinda
4 min readSep 2, 2019

--

What is Imbalanced Dataset?

Imbalanced Dataset is a very common problem in data science. It is a condition where classes are not represented equally or in other words, it is a condition where one class has more instances than the others. This condition can cause several problems such as the model cannot classify the minority class, you cannot use accuracy as the performance metrics, etc.

Solutions for Imbalanced Dataset

In this post, I use the Decision Tree algorithm on an imbalanced dataset. Before going to the code, let me tell you the most common solution for imbalanced dataset problem.
1. Oversampling
Oversampling is a technique to increase the number of instances of minority class so that it equals to the majority class. This technique works by copying the number of instances of the minority class. One method that you can use to do Oversampling is SMOTE ( Synthetic Minority Oversampling Technique).
2. Undersampling
Undersampling on the opposite is a technique to reduce the number of instances from the majority class so that it equals to the minority class. The method that you can use to do Undersampling is Random Under Sampling.

The Data

The dataset* used in this experiment is the Chinese Fall Detection I downloaded from Kaggle.
So first let’s take a look at the dataset
1. Data shape
Our data consists of 16382 rows with 7 columns. The first column named ‘Activity’ is the category consists of 6 classes (0, 1, 2, 3, 4, 5).

2. Data Tail
Here are the last 10 rows of the dataset

3. NaN Values
This time, we will see if the dataset contains missing values.

There is no missing value in our dataset, yes we can go on!

4. Data Distribution (Imbalanced)
The following is the distribution of instances on each class.

Our data here is very imbalanced. Every class has a different number of instances. As you can see that the number of class ‘0’ is 8 times as much as class ‘1’.

Decision Tree Model

To see the effect of imbalanced data on Decision Tree, I do 3 scenarios here. First, fitting the data into the model without any preprocessing step. Second, using the Oversampling technique. And finally using the Undersampling technique.

No Preprocessing
The only thing I did on the data is just splitting it into train and test dataset using train-test-split function and then fit it into the model.

#Fitting model, predict, evaluation
model = DecisionTreeClassifier()
acc1 = []
model.fit(X_train, y_train)
target_pred = model.predict(X_test)
Accuracy of DT with No Preprocessing

Oversampling
For oversampling, I use SMOTE method which is provided by the imblearn library.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#Apply Over Sampling
print(‘Before Oversampling’)
print(sorted(Counter(y_train).items()))
X_train, y_train = SMOTE().fit_resample(X_train, y_train)
print(‘After Oversampling’)
print(sorted(Counter(y_train).items()))
Accuracy of DT with Oversampling

Undersampling
I use Random Under Sampler method also from imblearn library for the undersampling technique and here is the result.

from imblearn.under_sampling import RandomUnderSampler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#Apply Random Under Sampling
rus = RandomUnderSampler(random_state=0)
X_train, y_train =rus.fit_resample(X_train, y_train)

Discussion

Let’s take a look at the following bar plot of each accuracy.

Accuracy

No preprocessing and Oversampling showed almost the same performance (with only 0,01% difference) which tell us that imbalanced dataset does not affect Decision Tree’s performance. This proves the theory that Decision Tree works well with imbalanced data. On the other side, the Undersampling technique obtained a noticeably lower accuracy. This may be caused by the reducing number of datasets after undersampling was performed. Random Over Sampling means that it will randomly remove any data so that each class has the same number of data as the minority class (in this case, Class ‘1’ with 502 rows). Random Over Sampling is an easy method but we may lose important data.

For the full code, please visit:

https://github.com/160shelf/decision-tree-on-imbalanced-dataset

Dataset by:
Özdemir, Ahmet Turan, and Billur Barshan. “Detecting Falls with Wearable Sensors Using Machine Learning Techniques.” Sensors (Basel, Switzerland) 14.6 (2014): 10691–10708. PMC. Web. 23 Apr. 2017.

--

--