Imbalanced Classification: Strategies for Addressing the Problem

Published in

SFU Professional Computer Science

12 min readFeb 9, 2023

Authors: Guneet Kher, Inderjeet Singh Bhatti, Nidhi Kantekar, Siddharth Goradia, Swagata Dutta

This blog is written and maintained by students in the Master of Science in Professional Computer Science Program at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/mpcs}.

A disproportionate ratio of observations in each class in your dataset can result in imbalanced classification, a common challenge in machine learning. This can hinder accurate predictions and negatively impact model performance. This blog will discuss the imbalanced classification, why it is a problem, and how to solve it.

What is Imbalanced Classification?

A skewed distribution of observations in each class characterizes an imbalanced dataset. This often results in the overrepresentation of one class; for instance, 90% of observations in one class and just 10% in another. This imbalanced distribution can cause a decline in prediction accuracy and decrease the overall performance of a machine-learning model.

Let’s imagine we are tasked with discovering fraudulent credit card transactions. In this case, the vast majority of transactions are expected to be authentic, with only a minimal fraction being considered fraudulent. Similarly, if we test individuals for lung cancer, the positive rate will be only a tiny fraction of those tested. There are several cases where such imbalanced data can occur like these:

A bank customer should receive a loan or not
A manufacturing company identifying defects in their products
Spam email filtering system
Systems for detecting intrusions
System alert for network faults
Satellite images detecting oil spills
Risk modelling for insurance
Detection of hardware faults

In the real world, it is tough to find perfectly balanced datasets since the tasks given are inherently difficult to identify and thus fewer in occurrence. This has been a common concern among many data scientists on how to deal with the problem of imbalanced data efficiently.

Causes of imbalanced distribution of classes

An imbalanced distribution of classes in a classification problem may stem from various causes. There are two main groups of causes that we may want to consider:

Data sampling
Properties of the domain

Data sampling

Data sampling is the process of collecting data from the problem domain. The imbalanced distribution of examples across classes in a classification problem may result from the sampling or collection process used in the problem domain. The causes of imbalance in the class distribution may include biases introduced during data collection, such as a non-representative sampling of the population, or errors made during data collection, such as improper labelling of examples.

Biased data sampling
Data collection errors

If the data is collected from a narrow geographical region, or a specific period, the allocation of classes (i.e. the different categories of data) may be quite different than if the data was collected from a wider geographical region or a longer period. Additionally, the distribution of classes may also be different if the data is collected in varying ways (e.g. using different methods or techniques).

For example, if a study is conducted on the prevalence of COVID-19 in Vancouver, the results may be skewed if the data is only collected from a small area or a short period. The results may differ if the data is compiled from a larger area or a longer period. Similarly, the results may also be different if the data is collected using different methods (e.g. surveys or interviews).

Mistakes may have been committed while gathering the data. One potential mistake could have been assigning incorrect labels to a lot of samples. On the other hand, the systems or processes from which the samples were taken may have been impaired or damaged, leading to the imbalance.

In cases where the imbalance is due to biased data sampling or data collection errors, the imbalance can be rectified by using better sampling techniques and correcting the measurement error. It’s important to note that the choice of data collection and sampling methods depend on the specific requirements of the problem domain and the available resources. In some cases, collecting a perfectly balanced dataset might not be feasible.

Properties of the domain

The second group of causes for the imbalance in class distribution is the properties of the domain itself. This could include the natural distribution of the classes in the problem domain or the difficulty of collecting examples from certain classes. For instance, if the problem domain is a medical diagnosis task, it may be more difficult to collect samples of rare diseases than common ones. This could lead to an imbalance in the class distribution of the data.

The dominance of one class may be more prominent than the other classes because the process of generating observations from one class is more costly in terms of time, money, computing capabilities or other assets. Therefore, acquiring more samples from the domain is frequently unrealistic or impossible to even out the class distribution. In this case, a model must be trained to recognize class differences.

Challenges with Imbalanced Data set

It is possible to identify various degrees of imbalance, which can then be used to determine whether additional steps are required to address the imbalance.

There are two main types of imbalance in data sets: slight imbalance, where the difference in the number of data points between the two classes is too small to be significant, and severe imbalance, where one class could have thousands of data points while the other has only ten.

It is not usually a problem if there is only a slight discrepancy between the classes. The issue can generally be addressed as if it were a regular classification task. However, if the classes are drastically imbalanced, it can be challenging to model and may necessitate the utilization of specialized methods.

Severe Imbalance

The classes that possess a substantial quantity of samples are referred to as the majority classes, while the class with a small number of samples is known as the minority class.

When dealing with an imbalanced classification issue, the minority class is usually the focus. This implies that a model’s precision to accurately predict the class label for the minority class is more significant than for the majority classes.

Due to the limited number of examples of the underrepresented class, it is difficult for a model to learn this class's characteristics and differentiate it from the majority class (or classes). This is because the majority class (or classes) can easily overwhelm the minority class. Many machine learning algorithms for classification predictive models are constructed and evaluated based on the assumption of a balanced distribution of classes. Therefore, if a model is applied without taking into account the minority class, it may focus only on learning the characteristics of the abundant observations, thus ignoring the instances from the underrepresented class which are of greater interest and whose predictions are more valuable.

Different Types of Imbalance

As seen above, it is evident that imbalanced data is a big hindrance compared to working with standard, balanced data. To make matters worse, there are different types of imbalance that each need to be treated in different ways.

Between-Class

This type of imbalanced distribution arises when there is a significant discrepancy between the classes, with one class having an overwhelming amount of data points compared to the minority class.

This imbalance is visible in the case of credit card fraud which uses customer purchasing behaviour, transaction details and frequency to decide if a transaction is fraudulent or not. In this data set, the number of non-fraudulent data points will be extremely high compared to the fraudulent data points.

Given a sufficient amount of data samples from both classes, the precision of the model will increase as the sampling distribution is more reflective of the data distribution. However, due to the law of large numbers, the majority class will have a more accurate representation than the minority class, regardless of the amount of data samples.

Within-class

Within-class imbalance occurs when there is an unequal distribution of samples within a single class. This can prove challenging for a machine learning model because it may focus too heavily on the dominant subgroup and neglect the minority subgroups as outliers.

Consider a binary classification problem where one class has a large number of samples, but within that class, there are subgroups with significantly different frequencies of occurrence. For example, in the fraud detection problem, the majority class may be “not fraudulent” transactions, but within that class, there may be subgroups such as “legitimate transactions” and “suspicious transactions.” Suppose a machine learning model is trained on this imbalanced data. In that case, it may focus too heavily on the dominant subgroup (legitimate transactions) and neglect the minority subgroup (suspicious transactions), leading to poor performance for that subgroup.

In this scenario, the model may result in biased or suboptimal performance, especially if the minority subgroup is of high importance. For example, in the fraud detection problem, accurately detecting the minority subgroup of “suspicious transactions” may be crucial for ensuring the system's security.

Fixing Dataset Imbalance

To address the issue of dataset imbalance, there are a variety of techniques that can be employed. Generally, these strategies can be categorized into two main categories: sampling methods and cost-sensitive methods.

Sampling methods

Sampling techniques involve either increasing the representation of the minority class through oversampling or decreasing the representation of the majority class through undersampling to achieve a more balanced dataset.

1. Oversampling

Simply put, oversampling involves generating instances belonging to the underrepresented class to reduce the gap between the number of data points and make the classification as unbiased as possible.

To reduce the disparity between the minority and majority classes, it is necessary to generate additional data points for the underrepresented class. The most frequent approach for this is to generate synthetic instances near existing samples or between two existing samples in the data space. This will help to make the imbalance of classes as negligible as possible, allowing for more accurate and reliable results.

However, adding false data points to a dataset can be detrimental to the model's accuracy. Overfitting on data is a significant risk, as the artificial data points can end up strengthening the noise rather than providing helpful information. Ultimately, it is crucial to be aware of the potential risks associated with adding false data points, as it can significantly affect the precision of the model. Over-sampling can be automated using methods such as SMOTE (Synthetic Minority Oversampling Technique).

SMOTE (Synthetic Minority Oversampling Technique)

SMOTE is a technique used to generate new samples between existing data points based on their local density and the boundaries between the different classes. It performs oversampling and can also use cleaning techniques such as undersampling to remove any redundant data.

SMOTE tries to achieve this oversampling without increasing noise or distorting the classification model. For each minority sample, the k-nearest neighbours in the underrepresented class are identified, and some of these neighbours are randomly selected. The amount of oversampling desired will establish the value of these chosen neighbours. Synthetic samples are then randomly generated along the lines joining the underrepresented instance and its selected neighbours, thus increasing the size of the minority sample. This process is repeated until the target amount of oversampling is achieved.

SMOTE is an overrepresentation strategy that works to generalize the decision region for the minority class, making it larger and less specific. This allows the model to focus more on instances belonging to the underrepresented class without causing overfitting. This is beneficial as it enables the model to better identify and classify minority class samples, consequently improving its overall accuracy.

2. Undersampling

Undersampling entails eliminating instances from the overrepresented class to reduce the gap between the number of data points and make the classification as unbiased as possible.

Statistical experts commonly recommend this strategy, but it is only effective when enough instances are available for the reduced class. By lowering the majority class to the same number of points as the minority class, the statistical properties of the distributions will become less precise. Nevertheless, this method does not involve adding artificial data points, thus avoiding any distortion of the data distribution.

Cost-sensitive methods

Cost-sensitive methods are a technique used to make a model more sensitive to the minority class. This is done by adjusting the cost of misclassifying samples into different classes. The goal is to make the model more sensitive to the minority class by assigning a higher cost to misclassifying samples from that class compared to samples from the majority class. This way, the model is encouraged to prioritize the minority class, making it more balanced and accurate.

These methods are very similar to over and undersampling, but the weights are adjusted instead of changing the number of samples to achieve the same result. The two cost-sensitive methods are Up-weighting and Down-weighting. Up-weighting involves increasing the weight of the minority class, while Down-weighting consists in decreasing the weight of the majority class. These methods are used to make the model more sensitive to the minority class and can be used in combination with other techniques, such as over and undersampling.

Cost-sensitive methods can be implemented using sklearn.utils.class_weight which can help alter the weights to make the model more sensitive to the minority class. The weights can be used with any sklearn classifier as the classification model.

In this case, the developer has set the instances to be balanced, meaning that weights will be assigned to each class based on their relative number of points. This is the recommended approach unless there is a specific reason to set the values differently. For example, suppose there are two classes, and one has ten times fewer points than the majority class. In that case, you can adjust the weighting accordingly to reflect this by providing a dictionary with the respective weights.

weights = {0:0.1, 1:1.0}

Many authors and data scientists have argued that cost-sensitive methods are a more effective way of dealing with imbalanced data than data sampling. While both approaches have their benefits and drawbacks, it is essential to consider which method is most suitable for the given dataset and task. Ultimately, the choice should be based on the unique demands of the project and the available data.

Comparing Cost-Sensitive vs Sampling methods

Cost-sensitive approaches alter the learning algorithm to account for the costs of incorrectly classifying various groups. This is accomplished by adjusting the weights given to each class or the classifier’s determination threshold. Instead of just maximizing overall accuracy, cost-sensitive approaches also aim to reduce the overall cost of misclassification.

On the other hand, sampling techniques include redistributing the training set’s examples to balance the distribution of the classes. The minority class may be oversampled, the majority class may be undersampled, or a combination of the two may be used to achieve this. Sampling techniques aim to balance the distribution of the classes so that the learning algorithm can learn from instances of all classes without being overloaded by examples of the dominant class.

Cost-sensitive and sampling methods can be used to enhance a classifier’s performance on imbalanced data, but they take distinct approaches to the issue. Unlike sampling techniques, cost-sensitive strategies alter the learning algorithm rather than the training data. The choices between these techniques come with different pros and cons. Implementing cost-sensitive solutions may pose more challenges and require greater understanding but may preserve more information. On the contrary, sampling techniques are simpler to implement but may result in a loss of information.

Conclusion

Imbalanced classification is a common issue in machine learning and data science. By understanding the causes and effects of imbalanced classifications, we can take steps to address the problem. Several strategies, such as resampling, using different evaluation metrics, and using cost-sensitive learning, can help mitigate the effects of imbalanced classification. By implementing these strategies and continuously monitoring our models, we can ensure that our models make fair and accurate predictions for all classes.