Deriving true probability from an ensemble classifier

Clément Schaff
GAMMA — Part of BCG X
6 min readMay 18, 2018

This article is co-signed with Antti Niskanen

One of the most common uses of data science in a business context is to predict binary outcomes, as they give organizations the opportunity to take action before it’s too late.

For example, a company might look to predict churn so that it can take steps to retain its customers. Churn is of particular interest in industries like banking or insurance, where the cost of customer acquisition tends to be high compared to the cost of retention.

To efficiently target and size the steps it would need to take, a company would need to compare the expected loss from churn — the value at risk, or VaR — against the cost of mitigating that churn. The VaR in this context would be the probability of an individual customer being a churner in a given period multiplied by the expected revenue (or profit) generated by that customer if he does not churn.

A subscription-based company like a telecom provider, for example, may want to offer free service to retain its existing subscribers. If it predicts a 50% probability of churn for a given customer, it could give that customer three months of free service without having the loss of subscription fees impact its bottom line. But if the probability of churn is only 20%, eliminating that customer’s subscription fees for three months would lower its profit.

The drawbacks of ensemble methods

There are a lot of models available to predict binary outcomes. Ensemble methods such as random forest (RF) are popular because they are easy to implement, do not require a lot of feature engineering, and can model the data using non-linear functions.

But even if they are very good at ranking observations from the most probable to the least probable positive outcome, one drawback of ensemble methods is that they don’t predict the probability of being one or zero, but rather just a score that has no statistical interpretation. While such a score is good enough to plot an ROC curve and show fancy AUC, it’s insufficient when it comes to providing real businesses with usable insights. In order to act wisely, an organization would want to know the actual churn probability.

Adapting classifiers for business

Luckily, very simple Bayesian logic allows us to transform any score predicted by an (ensemble) classifier to a probability that can be used in a business context. All we need is a prior on the distribution of positive and negative outcomes for the sample population, such as with this formula:

The most straightforward priors would involve setting P(1) and P(0) to be equal to the observed shares of positive and negative outcomes from the training set, which in practice would be equal to the current churn rate. Then the conditional probability density could simply be derived from histogram plots of the two categories.

The following code takes the classifier output probabilities as input along with the ground truth values and returns the “true” probabilities. The output probabilities can be obtained by utilizing the “predict_proba” method in the popular sklearn package. In pyspark.ml the pseudoprobability score can also be obtained.

import numpy as np
def plotcorrected(clf_score, true_labels):
# true_labels: length n array of ground truth classes, boolean
# clf_score: length n array of classifier scores

nbins = 15
bins = np.linspace(min(clf_score), max(clf_score), nbins, endpoint=True)
# p(score|1)
Pscoregiven1, _ = np.histogram(clf_score[true_labels], bins, density=True)
# p(score|0)
Pscoregiven0, _ = np.histogram(clf_score[np.logical_not(true_labels)], bins, density=True)

p1 = sum(true_labels)/true_labels.shape[0] #P(1)
p0 = 1-p1 #P(0)

# p(score|1)P(1)
up = p1*Pscoregiven1
# p(score|1)P(1)+p(score|0)P(0)=P(score)
down = p1*Pscoregiven1 + p0*Pscoregiven0
# desired P(1|score)
true_probs = up/down

return bins, true_probs

There are alternative methods available to calibrate the probabilities given by a classifier as discussed in Niculescu-Mizil and Caruana [2005]¹, some of them available in the calibration module of scikit-learn².

Random forest success and failure

To illustrate that this type of a Bayesian check is indeed worth performing if you intend to interpret the classifier scores as probability in your business application, let’s construct a simple example where we can see a random forest both succeed and fail miserably at predicting the class probability, even though the data set is as simple as it gets.

Going back to our telecom provider churn example, we might want include in the predictive model all of its customers’ characteristics (subscription type, contract duration, current usage or amount paid on top of it, etc.) as well as external factors such as seasonality, competitive offers, etc. It’s easy to think of hundreds of factors that might be relevant, and it might be hard to choose between them a priori. This is where ensemble methods are especially convenient: if your data set is large enough, adding a lot of features — even with weak predictive power — rarely alters to a significant degree the quality of the prediction. But as the example below makes clear, there is a drawback: it will greatly affect the out-of-the-box predicted probability of the classifier.

Suppose we have data from an underlying two-dimensional normal distribution. (See Exhibit 1). The two classes — 0 (no churn) and 1 (churn) — have slightly different means (the data is balanced in this illustration but the method is robust for unbalanced data sets). There are also 48 other dimensions with just Gaussian noise, which we know doesn’t correlate with class.

Exhibit 1: Leading dimensions of the data

What happens if we train a model with just the two leading dimensions (where all the signal is), and then with all 50? As expected, both seem to work well: we get nice ROC curves and AUCs of 0.77 and 0.75, respectively, even though the data sets seem to overlap a lot. According to Exhibit 2, there’s no indication that something is wrong. Granted, handpicking the leading dimensions seems to result in slightly better performance, which in many applications is worth money.

Exhibit 2: The ROC curves. The dashed line is a reference equal to random guessing

But what about the relationship between the classifier score and the corrected Bayesian estimate? As we see with the two leading dimensions in Exhibit 3, the probability correction is relatively minor and wouldn’t lead to wrong conclusions in churn modelling or other business predictions such as purchasing behavior.

However, the higher-dimension example is way off in terms of the classifier score. But once we fix that, we can use the corrected values instead.

Exhibit 3: Correction curves for probability. The dashed line is a reference that we should hit if all is well

Would we be able to see this from the ROC curve? Not from that curve alone, but we can easily see that P(RFscore|class) is in fact equal to the negative derivative of the true/false positive rate with respect to the classifier score. Having the parametrization of these rates would work just as well. The parametrization is clearly needed to be able to produce a plot like that shown in Exhibit 3.

Conclusion

We can therefore conclude that it is worthwhile to be suspicious of classifier output probabilities. That’s especially true if you’re using tools out of the box, be they closed or open-source, since the scores that are outputted by the implementation of RF in Python, Spark, or R might differ from one another. You need to check the math. Luckily, our favorite Bayes rule can come to the rescue.

What about business impact? Over- or underestimating the churn probability, as detailed above, can have a significant financial impact. For instance, if we decided to offer discounts based on a churn probability that is overestimated, that could result in shrinking margins. On the other hand, underestimating churn can lead to insufficient action, as illustrated by the pattern in Exhibit 3.

Indeed, if the raw output of the RF yields a churn rate (probability) of 70% when the reality is actually closer to 90%, that’s a big — and unnecessary — gap. And one that will cost a company a lot of money.

¹ Predicting Good Probabilities with Supervised Learning, A. Niculescu-Mizil & R. Caruana, ICML 2005

² Calibration should always be performed on a different set than the one used to train the classifier to avoid overfitting.

--

--