What data scientists keep missing about imbalanced datasets

Patrick Stewart
5 min readDec 4, 2021

--

Figure 1: (https://unsplash.com/photos/JKUTrJ4vK00)

Many data scientists fail to fully understand the problems imbalanced datasets cause and the methods to alleviate this.

As data scientists we come across many different datasets where there is a clear dominance in some types of data instances (known as majority classes) with other types significantly underrepresented (minority classes). This has significant implications for the practice of data science, where simply training a model on a dataset with this characteristic will likely lead to bias towards the majority classes. For example, if we were focussed on predicting heart disease and had a dataset of 20 people with the disease and 80 without, we could have a case with a model predicting no disease every time and as such achieving a solid accuracy score of 80% and an F1-score of 88%.

Despite this well-known problem, there are too many cases where data scientists have ignored this issue and just trained a model without a real understanding of imbalances within the dataset. The purpose of this article is to give you simple methods that can be used to address this problem. Please note that no method is perfect and as a result you need both a strong understanding of the problem and a broad repertoire of solutions to this.

The problem characteristics of imbalanced datasets

Firstly, it is important to understand why imbalanced datasets are a crucial problem that needs to be addressed. We can do this by looking at this with respect to key data issues:

1. Small disjuncts

- Explanation: the small disjuncts problem occurs when there are clusters of the dataset which have a much higher misclassification rate than the overall rate.

- How this is a problem?: this is problematic for divide-and-conquer centred algorithms such as decision trees where certain instance types, namely those with a minority class are likely to have a very poor classification performance.

2. Lack of density

- Explanation: there is a lack of information about some classes.

- How is this a problem?: induction algorithms do not have enough data to make generalizations about the distribution of samples, with minority classes most at risk of being misrepresented in the model.

3. Noisy data

- Explanation: presence of noise in the dataset has a greater impact on the minority classes than on the other classes.

- How is this a problem?: noise will impact the minority classes to a greater extent, significantly impacting the model.

4. Dataset shift

- Explanation: this occurs when the training and test datasets follow different distributions. This is a common issue and affects all sorts of categorization problems, often as a result of sample selection bias.

- How is this a problem?: with highly imbalanced datasets, the minority class is particularly sensitive to classification errors and therefore is impacted by the shift to a greater extent.

Data led techniques to resolve this

Data led techniques aim to reduce the skew in the ratio between the underrepresented and overrepresented classes by either increasing the representation of the minority classes (oversampling) or reducing the representation of the majority classes (undersampling) in a transformed dataset, which models can be successfully trained on. The figure below gives a more exhaustive list of the various techniques that can be used, however in this article we focus on three of the most commonly used techniques; random oversampling, random undersampling and SMOTE.

Figure 2: examples of data level approaches to the problem of imbalanced datasets

Random oversampling

Random oversampling is the process of randomly duplicating examples from the minority class and adding these to the training dataset with the aim of achieving a much more balanced dataset. Whilst it benefits from being a relatively simple process, there is no such thing as a free lunch and oversampling is no exception as it can cause the model to overfit.

The simplistic nature of the method means it is very easy to implement using Python. The function below shows a from-scratch implementation of oversampling when we have a binary target class (1 or 0) in our dataset. The code for all the functions and the dataset used can be found at this github link.

Random undersampling

Random undersampling is the process of randomly removing instances from the majority class or classes. This results in a much lower number of examples for the majority class in the transformed dataset. Undersampling is less commonly used than oversampling and is more relevant where there is a significant amount of instances in the minority classification so that a useful model can still be achieved. As previously noted, there is no such thing as a free lunch and that this technique loses information with instances from the majority class that are deleted potentially being critical to the generation of a well-defined decision boundary.

The function below shows a from-scratch implementation of undersampling when we have a binary target class (1 or 0) in our dataset. The code for all the functions and the dataset used can be found at this github link.

SMOTE

As noted previously, random oversampling is far from perfect, a better method is to synthesize examples from the minority class before fitting the model. Synthetic minority oversampling technique (SMOTE) is a method to achieve this.

SMOTE works by choosing an instance in the minority class at random. Then the k-nearest neighbours of that instance is found (e.g. knn = 5), with one of these nearest neighbours selected at random. From this a new synthetic instance is created as a convex combination of the target datapoint and target chosen nearest neighbour datapoint.

This technique has proved effective but is less relevant when there is significant overlap between the minority and majority classes. In this case the synthetic data created could be somewhat ambiguous if the nearest neighbour often used is the opposite class.

The code below shows how SMOTE can be implemented using the imbalanced-learn package (pip install imbalanced-learn). The code for all the functions and the dataset used can be found at this github link.

Conclusion

So, there you have it, this article has shown you three easy ways to improve the issue of imbalanced datasets. Please note that none of these methods are a one size fits all case and that a good grasp of a range of the techniques noted in figure 2 are essential. In addition, there are other broader methods of tackling imbalanced datasets, at the algorithmic, cost sensitive and/or feature level.

Follow me here for more updates: https://twitter.com/Patrick74925271

--

--