The lens matters: your imbalanced classification model might not be as bad as you think

Published in

CVS Health Tech Blog

12 min readJan 5, 2024

Imbalanced Classification in the real world

Imbalanced classification problems are incredibly common in the real world. Identifying fraudulent transactions, predicting who will be the highest-spending customers, and predicting the likelihood of workplace accidents are all examples of imbalanced classification problems with significant real-world consequences. If a manufacturer could effectively identify the likelihood of a workplace accident, they could proactively fix the issue before the accident occurs.

More specific to our work at CVS Health, healthcare data is notoriously skewed, given that the costliest 1% of patients account for 20% of all health care spending in the United States, leading to classification problems with severe imbalance.

Approaches to handling class imbalance

Many data scientists have a general understanding of techniques used to address imbalanced classification, which mostly fall into data-driven and model-driven approaches. Data-driven approaches typically resample the training data to address the imbalance (minority over-sampling, majority under-sampling, or synthetic data generation algorithms such as SMOTE or AdaSyn), while model-driven approaches adjust the loss function to over-weight or under-weight records or, as is the case for focal loss, guide the model to focus on records more difficult to classify.

While these approaches can be useful for squeezing better performance out of classifiers, they don’t define the lens through which these models should be evaluated and compared to each other.

For a more in-depth treatment on approaches to handling class imbalance:
Survey on high-class imbalance

How to evaluate models with extreme imbalance

Data scientists have a plethora of metrics available to them to evaluate classification models. Metrics like accuracy, precision, recall, F-scores, AUROC and AUCPR can all be used under typical circumstances. But the crux of the issue is that most of these cannot be used — or can be wholly misleading — when working with extreme class imbalance. While most tutorials and resources adequately address the issue with using accuracy or AUROC as an evaluation metric and some further package developments, such as balanced accuracy in sklearn, have helped address the issue, what often gets missed, and what we often should care most about in these cases, is how the model performs within the highest strata of probability.

Often, classification problems are so imbalanced that the resulting model has precision and recall well below what most data scientists would consider an effective model. Typically, these metrics alone don’t tell the full story. Yet by framing the problem correctly up front, many data scientists will find that their model does indeed get decent or perhaps even downright solid performance at the highest strata of probability. In the following guide, we propose using enrichment tables, gain charts, and lift charts to better understand model performance at the highest probability strata. Not only do these methods provide more granularity to model performance, but they also tie directly into real-world implications and interventions.

We developed an open-source package, aequilibrium, that implements many of the functions we will use throughout the article. Please check out this notebook for a step-by-step guide through the example.

Example Problem:

We are going to use a publicly-available Kaggle dataset on stroke prediction (link) to exhibit the contrast between traditional model metrics and the three proposed: Enrichment Tables, Gain Charts, Lift Charts.

Before we begin, let’s ensure that we have the file in the correct place. First, let’s create a new directory for the project. If you want to do this from command line, you can do so with the following code snippets:

mkdir imbalanced_classification
cd imbalanced_classification

If you have never used Kaggle before, you will be required to create a free account before downloading the data. Once the data is downloaded, save the data in the newly created imbalanced_classification folder under the name ‘example_stroke_dataset.csv’.

After downloading the dataset at the link above, open a Jupyter Notebook in the imbalanced_classification folder and load the data into your notebook.

Next, we want to check the balance of this data.

Our dataset has a positive class rate of 1.804%. Note that we often see class rates on a much smaller magnitude in healthcare data. In our experience, it is possible to build effective models on problems with a positive class rate as low as 0.01%, but ultimately this depends on the specific use case.

Now, we will build a basic machine learning model. Note that building the “best” model architecture is outside the scope of this post. We simply want to display the tools we use to evaluate model performance. We’ll be using sklearn package for pre-processing, modeling, and common model metrics.

# preprocessing
stroke_df_hot = pd.get_dummies(stroke_df)
stroke_df_hot = stroke_df_hot.dropna()
X, y = stroke_df_hot.drop(['id','stroke'], axis =1), stroke_df_hot['stroke']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Building Model
model = GradientBoostingClassifier(random_state=1)
model.fit(X_train,y_train)

# Get Predictions
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)[:,1]

ans = pd.concat([X_test,y_test],axis = 1)
ans['y_pred'] = predictions
ans['y_proba'] = probabilities

Standard Metrics

Let’s check the performance of this model using traditional metrics of accuracy, precision, recall, and area-under-ROC curve.

Many commonly used metrics, such as precision, recall and accuracy, are dependent upon a model threshold. By default, most packages will automatically set the threshold at the point with the maximum F1 score or, simply at 0.5. Note that the user has full freedom to adjust the threshold and, in turn, see potentially different results for precision, recall, and accuracy. However, deciding how to set the threshold becomes arbitrary and we are still left with a metric that does not tell the full picture of model performance. For now, let’s use the default threshold of 0.5 from sklearn.

Hopefully, these results reinforce how misleading accuracy is as a metric for imbalanced classification, as is AUROC. Our model has an accuracy of 98.5%, which seems like incredible performance. However, the reader can clearly see that something is wrong by noticing that both precision and recall are equal to 0. This implies that of our positive predictions, none of them are true positives. Most likely, the model is predicting everything to be negative.

Okay, so clearly this model needs some work. Let’s run through two popular ways to handle class imbalance: “vanilla minority over-sampling” and “SMOTE” (algorithm explanations are outside the scope of this post). We’ll be using imblearn package for these sampling algorithms. Note that the model metrics reported throughout this post are on the test set, which has not been artificially balanced.

Imbalanced Modeling approaches:

Vanilla-oversampling with the following function:

def randomOversampling(data, response,ratio = None):
        X = data.drop(response,axis = 1)
        Y = data[response]
        os = RandomOverSampler(random_state=42)
        X_os , Y_os = os.fit_resample(X,Y)
        return pd.concat([X_os,Y_os],axis = 1)

SMOTE with the following function:

def SMOTEoversampling(data,response):
    X = data.drop(response,axis = 1)
    Y = data[response]
    smote = SMOTE(random_state=42)
    X_smote, Y_smote= smote.fit_resample(X,Y)
    
     return pd.concat([X_smote,Y_smote],axis = 1)

The standard results clearly show that both algorithms do a better job than the original model. However, we are left with some questions. How are we going to determine which model is better? How does this link to the real-world use case for which we intend to use this model?

Traditional metrics limit our ability to answer these questions, so let’s get to the punch line.

Gain/Lift Table

Gain/Lift tables are an intuitive concept once a reader is familiar with them. At a high-level, model lift scores represent how much more likely a group is to be in the positive class compared to a uniformly random selection.

To calculate this number, the first step is to split the predictions into “risk strata”. For example, let’s take the top 1% of model output scores. Let’s say our test set had 20,000 individuals — we will use a rounded example dataset to explain clearly. Then the top 1% would include the highest 20,000*0.01 = 200 model output probability scores. The next step is to simply count how many of these individuals are in the positive class. Let’s say of the top 200 model output scores, 38 individuals are in the positive class. Therefore, of our top 200 output scores, 19% had an event. To get to model lift, simply divide 19% by the expected event rate in this group, given a random selection, which is equivalent to the balance of the dataset (1.804%) — leading to a model lift of 10.34. Said another way, the people in the top 1% of risk are 10.34 times more likely to experience a stroke event than a same-size population chose at random. That sounds pretty good! Our model, that we might have disregarded as useless upon initial inspection of the precision and recall scores, could have substantial utility.

Now let’s compare each model on a gain/lift table (note that the full output is not included here due to space constraints). This is where the aequilibrium package is particularly useful.

To install this package run the following command:

pip install aequilibrium

Now, we can simply run the following commands to get a full suite of outputs:

y_true = "stroke"
y_pred = "y_pred"
y_proba = "y_proba"

results_class = Results(ans[y_true], ans[y_pred], ans[y_proba])
results_viz = Visualize(results_class, num_decimals=3)
measure_df = results_viz.complete_evaluation(save_dir="my_plots")

Original

Vanilla-Oversampling

SMOTE

Notice that by displaying performance at distinct probability strata in each row we can easily use the enrichment table to model real-world interventions. For example, assume for a hypothetical marketing campaign we are constrained to $832 in marketing costs, with each outreach costing $2. Therefore, we are limited to outreach 416 individuals (832/2) and we will assume that the best way to do this would be to take the top 416 (equivalent to the top 3% risk stratum) individuals by model risk score. Which model should we use? Gain/lift tables allow us to easily compare these models at the top 3% risk strata. The original model’s lift of 5.25 at the top 3% risk strata outperforms both the SMOTE-oversampling model and the “vanilla oversampling” model lift scores of 3.94 and 5.09, respectively.

If the audience is familiar with seeing precision and recall scores, gain/lift tables allow for a more granular look at precision and recall within each stratum. For example, in the top 3% of the original model enrichment table above, we see that this group includes 416 people, of which 32 are true positives (precision = 32/416 = 7.7%). In the entire dataset we have a total of 203 actual positive class data, therefore the top 3% accounts for 15.76% of the events (i.e. recall).

Let’s finish with a quick look at two effective ways to summarize and visualize Lift and Gain charts.

Lift Charts

A lift chart is a graphical representation of the lift scores in the gain/lift tables compared to a random selection. We plot the risk strata on the x-axis and the lift score on the y-axis. Note that if we were to classify members into these groups at random, we would expect to have a lift of 1 across all strata (represented by the red line below).

The lift charts were already output by the aequilibrium functions above.

Lift charts provide a variety of opportunities for interpretation. For example, all three models appear to perform relatively similarly at strata below the top 30%. Therefore, not much difference would exist in performance if planning to target over 30% of the members. However, within the 0–10% risk strata range, the “Vanilla over-sampling” model performs the best, with the “original” (unbalanced) model as a close second. Surprisingly, the SMOTE model performs worst of all three models in the 0–10% risk strata range.

Gain charts

While lift charts represent the lift scores in the gain/lift tables, gain charts similarly represent the “% of dataset”. We plot the risk strata on the x-axis and the % of positive events on the y-axis. Note that if we were to classify members into these groups randomly, we would expect to have the same percent of events as percent of data set — for example, if we grabbed 40% of our data at random, we would expect to also see 40% of the positive events (represented by the red line). Let’s compare. Gain charts were also already output.

Overall, gain chart representations make abundantly clear how much better any of our three models are compared to random selection. One way to interpret a gain chart is to translate “% of events” to model recall. Sticking with our marketing campaign example, say we wanted to reach out to the minimum number of individuals possible, so that we had a recall of at least 80%. We would then find the point on the x-axis (% of dataset) where the blue line crosses 80% of events. If we did this for all three models, we would then have an objective way to decide which model to use to outreach to the fewest individuals possible while obtaining an 80% recall.

Tying it all together

Through the right lens of gain/lift tables, lift and gain charts, models written off via traditional metrics can reveal they indeed have potential utility at the highest strata of probability. We can take this a step further and demonstrate how we can use this information directly to build up an intervention.

Let’s imagine the Center for Disease Control (CDC) just published information that a new therapeutic can reduce risk of stroke by 30%. We wish to proactively reach out via phone calls to educate people about this wonderful news and refer them to the clinic where they can get treated. Additionally, each call costs $2 of labor and the marketing team has constrained us to spending $832 on this campaign.

We can use our gain/lift table we defined previously to “model” how this might play out. Let’s carve out the top 3% stratum from the original model above and mimic what an intervention might look like.

intervene_df = measure_df[measure_df['percentile'].isin([0.03])][['percentile','row_count','pos_count','perc_random_events','perc_actual_events','Model_Lift']]
intervene_df

The total number of people in this stratum is 416 and they represent our model positives, the people we would pick up the phone and attempt to call. But not everyone will answer the phone and be interested in speaking with us. So, let’s say our “reach rate” is 30%. Let’s add these columns.

Reach_rate = 0.30
cost_of_call=2

intervene_df['TotalNum_Called'] = intervene_df['row_count']
intervene_df['TotalNum_Reached'] = round(intervene_df['row_count']*reach_rate)
intervene_df['TotalCost'] = intervene_df['TotalNum_Called']*cost_of_call
intervene_df

The “TotalNum_Reached” column displays this and suggests we would reach 125 people. Let’s simply assume for the sake of this article that 100% of the people we speak with agree to participate. Finally, let’s apply our efficacy of the intervention to estimate our impact. Let’s add these columns.

participation_rate = 1.00
stroke_reduction_rate= 0.3
intervene_df['TotalNum_Participate'] = round(intervene_df['TotalNum_Reached']*participation_rate)
intervene_df['TotalNum_Participate_wStroke'] = round(intervene_df['TotalNum_Participate']*(intervene_df['pos_count']/intervene_df['row_count']))
intervene_df['TotalStrokesReduced'] = round(intervene_df['TotalNum_Participate_wStroke']*stroke_reduction_rate)
intervene_df

We find that this intervention at the top 3% of risk means we have the chance to prevent an estimated 3 strokes. Fantastic!

As a final note, while inspection of our highest risk strata often reveals whether our model is performing with an acceptable amount of lift, it is worth noting that precision will typically be low. Depending on our use case, that may or may not be ok. In our example sending an intervention to the top 3% of risk, most of our marketing budget will be used to send messages to patients that will likely never have a stroke. Each model needs to be evaluated through the lens of the intervention. In our view, the lost marketing budget to help the patients that will be true positives prevent a stroke is a most-acceptable proposition.

Conclusion

Hopefully, the usefulness of these approaches to evaluate models with large class imbalance is evident. In this post, we walked through a toy example using a Kaggle dataset, which illustrates how useful these metrics can be when making business decisions. Gain and lift charts provide the functionality and granularity needed to make business decisions that traditional metrics, such as precision, recall, and AUROC cannot provide in scenarios of class imbalance. Finally, we introduced a package, aequilibrium, designed to handle all these considerations.

References

Norbeck T. B. (2013). Drivers of health care costs. A Physicians Foundation white paper — second of a three-part series. Missouri medicine, 110(2), 113–118.

Leevy, J. L. (n.d.). A survey on addressing high-class imbalance in big data — springeropen. A survey on addressing high‑class imbalance in big data. https://journalofbigdata.springeropen.com/counter/pdf/10.1186/s40537-018-0151-6.pdf

Aequilibrium. GitHub. (n.d.).https://github.com/cvs-health/aequilibrium/blob/master/docs/Examples/imbalanced_classification_blog.ipynb

Cerebral Stroke Prediction. Kaggle. (n.d.). https://www.kaggle.com/datasets/shashwatwork/cerebral-stroke-predictionimbalaced-dataset

Scikit-learn. PyPI. (n.d.). https://pypi.org/project/scikit-learn/

Imbalanced-learn. PyPI. (n.d.). https://pypi.org/project/imbalanced-learn/

The lens matters: your imbalanced classification model might not be as bad as you think

References

Written by Luke Beasley