Week 6 — Dealing with Imbalanced Data

Öner İnce
bbm406f19
Published in
3 min readJan 5, 2020
Dealing with an imbalance dataset tough to handle

In our previous blog post we mentioned that we will use some techniques to solve the imbalance problem of our dataset. We see in various algorithms that with a imbalance dataset it is not possible to get a meaningful prediction results. Before go in to detail of this process, will talk about the common metrics in machine learning field to evaluate algorithms.

Accuracy is the most common metric in the field. However, it can give very wrong impression on the result for an imbalanced dataset. For example, in our dataset target classes’ distribution as follows :

In a naive classification model, if model predicts all instances as Grade 5, it will have approximately a %40 accuracy. As we have 5 classes, this %40 seems not that bad. However, actually it is much worse. To better measure a machine learning model there are some other concepts. These are; recall, precision, F1 score and confusion matrix.

Precision : Precision of a class define how trustable is the result when the model answer that a point belongs to that class.

Recall : Recall of a class expresses how well the model is able to detect that class.

F1 Score : F1 score of a class is given by the harmonic mean of precision and recall (2×precision×recall / (precision + recall)), it combines precision and recall of a class in one metric.

These metrics can be summarized as follows :

Confusion matrix and metrics

To measure the performance of all algorithms that we used, we created a confusion matrix and evaluated all algorithms with that.

To deal with the imbalance problem that mentioned at the beginning of the post, we used an external library called imbalanced-learn.

We used the 2 different resample methods from this library. These are SMOTEENN and SMOTETomek. Data distribution created by these two methods as follows:

We have observed better confusion matrices using these resample methods.

As can be seen from the increase in scores in diagonal axis, resample methods worked positively. Accuracy was 40 ~ 45% before resampling. The algorithms were prone to predict damage grade 5. After the resample, there were small decrease in the accuracy for some algorithms. However, correct prediction of class labels increased. We have tried different algorithms and these resample method able to come up with better results in each of them.

Next week we will finish up our project and share the result from different algorithms and comment on which ones performed better than others.

See you next week!

--

--