## HANDLING IMBALANCED DATASETS IN ML | TOWARDS AI

# Balancing Act in Datasets of a Machine Learning Algorithm

## Techniques for mitigating the effects of training classifiers with imbalanced datasets in Python

# What happens when you train a classifier with imbalanced data?

When dealing with imbalanced classes, we may need to do some extra work and planning to make sure that our algorithms give us useful results.

In this blog, I examine just two classification techniques to illustrate the issue, but you should know that the problem generalizes. For good reason, supervised classification algorithms — which use labeled data — take class distributions into account. However, when we’re trying to detect classes that are important, but rare compared to the alternatives, it can be difficult to develop a model that catches them.

Here, after diving into the problem with some examples, I outline a few of the tried and true techniques for solving it.

# Low priors, high priorities

## Naive Bayes

Say you’re building a prediction model to detect fraud. You have a dataset with 10,000 rows and some feature columns, with each row labeled 1 for fraudulent transactions or 0 for valid transactions. For simplicity’s sake, I’ll focus on one feature, X. The breakdown looks like this:

Naive Bayes classification models predict the probability of class-membership using the ratio formulation of Bayes Theorem.

`Formula: Pr(C1|E)/Pr(C2|E)/ = Pr(E|C1)*Pr(C1) / Pr(E|C2)*P(C2)`

The model predicts whichever side of the ratio is higher.

If the prior probability of Class 1 is very low and the prior probability of Class 2 is very high, a Naive Bayes model will often fail to predict Class 1, even if the likelihood of Class 1 (given the evidence) is very high and the likelihood of Class 2 is very low.

Treating fraud as Class 1 and not fraud as Class 2, the calculation for a row in which feature X is not present is:

Pr(E|C1) = 90/100 = .9

Pr(C1) = 100/10,000 = .01

Pr(E|C2) = 10/100 = .1

Pr(C2) = 9,900/10,000 = .99.9 * .01 / .1 * .99 = .009/.099

Since the denominator is higher, the model will predict Class 2. This is in spite of the fact that the evidence favors fraud. 90 out of 100 cases of fraud do not have feature X, so this transaction fits the profile of most known frauds.

With the prior probability of Class 1 being so low, the output of the algorithm will nearly always favor Class 2, regardless of the evidence. In this case, the likelihood would have to be greater than .99 to tip the scales.

If you want a model that can distinguish between Class 1 and Class 2, this is bad news. Missing fraud has a high cost for both the bank and the cardholder. It’s in everyone’s interest to catch it and prevent it whenever it happens, even if it has a low probability in the grand scheme of banking transactions.

## Logistic Regression

Logistic regression is another classification algorithm that outputs probabilities. A line of best fit is plotted to the data on a log-odds scale and then the y-axis is converted to probabilities, turning the line into an s-curve. As x decreases and the curve flattens at the bottom, the probability approaches zero; as x increases and the curve flattens at the top, the probability approaches 1.

A low prior for Class 1 will make predictions of Class 1 difficult to come by because logistic regression determines the line of best fit by finding the line with the highest maximum likelihood. The maximum likelihood is the sum of the log-likelihoods for the data points on a given line. The log-likelihoods are determined by projecting a data point from the x-axis to the line and finding the log of the corresponding probability on the y-axis. If there are very few cases of Class 1, the line of best fit may be one that rarely if ever assigns a high probability to Class 1. In those cases, the model can get a better maximum likelihood score by correctly categorizing as many of the majority class as possible while ignoring the minority class.

**To recap:** If you want to classify something using machine learning and it has a low prior probability, you’re going to run into trouble. Luckily, there are some strategies you can use to nudge your classifier to predict classes with low priors. I outline three of them here.

# Strategy 1: Change the priors with bootstrapping

You can change the class priors by resampling your data.

If the imbalance is small you can do random undersampling of your majority class. I wouldn’t recommend this option if your classes are highly imbalanced because you would lose a lot of data.

Another option is to use either random oversampling or synthetic oversampling on your minority class. Imblearn’s SMOTE creates synthetic samples by finding each datapoint’s nearest neighbors (5 is the default) and plotting points on the vectors between them. The number of points on each vector varies depending on the number of samples needed to balance the classes.

The image below shows data before and after SMOTE:

You can see that there are now additional data points for the minority class clustered between the existing ones.

You can use SMOTE in an “imblearn” pipeline with a classifier and then fit the pipeline to your data like so:

from imblearn.pipeline import Pipeline

from imblearn.over_sampling import SMOTEsmote = SMOTE()

cls = LogisticRegression()pipe = Pipeline([(‘smt’, smote), (‘cls’, cls)])

pipe.fit(X_train, y_train)

SMOTE has a number of variants that you may wish to consider, such as SMOTENC, which is designed to handle categorical features. There are detailed descriptions in the user guide.

# Strategy 2: Adjust the loss function

Some classifiers have an optional “class weights” parameter. You can use class weights to adjust the loss function so that your model won’t optimize by getting correct predictions on the majority class only. You assign class weights using a dictionary with the ratio you want:

class_weight={0:1,1:10}

Here’s what this looks like with various weight combinations for log loss (**−(ylog(p)+(1−y)log(1−p)**) when the correct classification is 1.

Values of .5 or greater on the x-axis are predictions of 1, with low to high probability. Values lower than .5 on the x-axis are predictions of 0 (not Class 1), with high to low probability. The higher the probability assigned to 1 (the correct class), the lower the log loss. The higher the probability the model assigns to 0 (the incorrect class), the higher the log loss. The higher the class weights we assign, the steeper the penalties for incorrect predictions.

# Strategy 3: Change the threshold for prediction

If your model is predicting the probability of your minority class so that probable cases are consistently ‘flying under the radar’ at 30 or 40%, you may want to lower the threshold for a positive prediction.

To find out if this is happening, use your classifier’s predict_proba() method. Then you can make a custom prediction list with a threshold that catches more cases of your target class. The code should look something like this:

`probabilities = cls.predict_proba(x_test)[:,1]`

y_hat = [1 if i>.35 else 0 for i in probabilities]

Beware: If you move the threshold too far, you may get more false positives in the deal than you bargained for. Be sure you plot a confusion matrix and compare the F1 score before and after you move the threshold so you can make adjustments as needed.

# Summing Up

You can deal with imbalanced classes by:

- Changing the priors with resampling
- Adjusting the loss function
- Changing the probability threshold for prediction

In some cases, you may need to use all three to make a model that can predict a rare class consistently.

**Resources**

Here is a project I did with Helen Levy-Myers and Adam Blomfield that uses some of the above techniques:

Here are some helpful links: