The weather has been a bit imbalanced of late, here in Mallorca.

Dealing with Highly Imbalanced Classes in Classification Algorithms

Daniel Bestard Delgado
bluekiri
6 min readOct 31, 2017

--

On the 26th of October, I had the pleasure to give a speech at the PyData Mallorca, which is an international community of users and developers of data analysis tools, in the name of Bluekiri.

The aim of the talk was to tackle one of the hottest topics in classification algorithms due do its common presence: dealing with highly imbalanced classes.

Two of the most trivial examples that the reader might have heard before is classifying whether an email is Spam and whether a credit card transaction is a fraud.

These two cases present good examples of highly imbalanced classes because Spam emails and credit card frauds can be considered as rare cases.

Given that Bluekiri is a company focused on the competitive travel industry, let me introduce to you a simplified version of real case scenario where Bluekiri was able to provide a successful solution to the Online Travel Agency (OTA) Logitravel by performing an extensive research about the topic of highly imbalanced classes. Logitravel, as well as any other OTA, receives requests for availability of a myriad of products. Hence, OTAs have to check in real time the price of such requests with its providers. If these providers do not have enough technological infrastructure to return an answer every time, then the connection with the OTA can be cut by the provider in order to avoid breakdowns. Therefore, it is of high interest of the OTA to send requests for availability to its providers when the probability of sale is high enough. That is, the goal of Bluekiri was to accurately estimate the probability that Logitravel requests end up in conversions. This is a classification problem with highly imbalanced classes given that a high percentage of requests do not turn into conversions.

Once motivated the topic of highly imbalanced classes in classification, let’s move on to some of its important issues. Let’s start with a kind of “scary” sentence that will make sense in the following lines: doing classification when classes are highly imbalanced leads to underestimation of conditional probabilities of the minority class. To visualize this we use an image from the academic paper Improving Class Probability Estimates for Imbalanced Data by Bryon C. Wallence and Issa J. Dahebreh.

The two plotted distributions correspond to a single continuous predictor in the classification problem (simplified version). The left distribution corresponds to the majority class whose data points are represented as squares and the right distribution corresponds to the minority class whose data points are represented by crosses. The red curve is a sigmoid fit when the problem is imbalanced and the dotted curve is the same sigmoid fit when the problem is balanced (see later how to make it balanced). The vertical axis corresponds to the probability that an observation belongs to the class represented by crosses. Note that the red curve is consistently below the dotted curve, which basically provides visual evidence of the “scary” sentence from the previous paragraph. That is, when the classes are highly imbalanced, classification algorithms underestimate the conditional (on the predictors) probability of the minority class.

Once the consequences of dealing with highly imbalanced classes are understood it is time to provide some solutions to get rid of the bias in the probability estimation seen above. A very common procedure is to make the problem balanced or at least less imbalanced. The most commonly used sampling strategies that achieve this goal are:

- Undersampling (balanced): all the observations from the minority class are kept and sampling without replacement is performed in the majority class where the number of observations sampled is equal to the sample size of the minority class.

- Upsampling (balanced): decide how many times the sample size of the minority class wants to be used and perform sampling with replacement in the minority class and sampling without replacement in the majority class. Be careful with this strategy because it has the risk to trigger overfitting due to the repetition of the same observations in the minority class.

- Negative downsampling (imbalanced): different samples sizes are used in this procedure. In all these samples all the observations from the minority class are kept and we take different number of observations from the majority class by performing sampling without replacement.

In our project, after trying the previous sampling strategies as well as others with a higher level of complexity, which are out of the scope of this article, our team came up with the conclusion that the negative downsampling approach was the mos appropriate one. Hence, the next question that raises is:

How many observations from the majority class should be use?

In order to answer this question we had to choose a metric that properly measures the efficacy of a classifier and is insensitive to highly imbalanced classes. Insensitiveness of highly imbalanced classes is crucial due to the accuracy paradox. That is, in a case of highly imbalanced classes, always predicting the majority class can lead to very good accuracy scores even though the algorithm is poor. Among several metrics we could use, we decided to apply the normalized cross-entropy, which is introduced in the academic paper Practical Lessons from Predicting Clicks on Ads at Facebook. Explaining how this metric is constructed is out of the scope of this article but we must keep in mind that it is a function of the cross-entropy and, therefore, the smaller the value the better the performance of the algorithm. Let’s look at the performance of negative downsampling in our project using this metric.

The previous figure displays the normalized cross-entropy for different number of observations of the majority class in the negative sampling procedure when logistic regression was fitted (we show the case of logistic regression because it is a well known classification algorithm, but other models with a higher level of complexity were used in the study). Note that when the number of observations in the majority class is too little or too large the performance is poor. In our case we found the the optimum was to have a number of observations in the majority class that was three times the number of observations in the minority class.

Finally, a last issue to keep in mind is to calibrate the predicted probabilities. Remember, that when classes are highly imbalanced the probabilities of the minority class are underestimated. One could check whether the predicted probabilities are calibrated by using the well known calibration plot. The horizontal axis represent the counts of the predicted probabilities for different bins and the vertical axis are the counts of the true number of observations with label 1 that lay into such bins. The calibration plot from our study has the following shape in the case of logistic regression.

The dotted diagonal line represents the situation where the probabilities are perfectly calibrated and the blue curve represents the calibration of the predicted probabilities by the logistic regression. Note that probabilities are very well calibrated!

Two common procedures to calibrate probabilities are Platt’s scaling, where the predicted probabilities play the role of predictor in a logistic regression where the outcome are the true labels, and isotonic regression, which is a nonparametric approach. In the case of logistic regression we did not have to calibrate the probabilities because they were already calibrated. Actually, when trying to calibrate them using these two methods they get worse, so careful with that!

--

--