Visualise the business value of predictive models

No Data Scientist is the Same — part 8

Published in

Cmotions

14 min readMar 4, 2022

This article is part of our series about how different types of data scientists build similar models differently. No human is the same and therefore also no data scientist is the same. And the circumstances under which a data challenge needs to be handled change constantly. For these reasons, different approaches can and will be used to complete the task at hand. In our series we will explore the four different approaches of our data scientists — Meta Oric, Aki Razzi, Andy Stand, and Eqaan Librium. They are presented with the task to build a model to predict whether employees of a company — STARDATAPEPS — will look for a new job or not. Based on their distinct profiles discussed in the first blog you can already imagine that their approaches will be quite different.

In this article we will discuss how to assess the business value of a predictive model, like the ones developed in our series. Especially Andy Stand recognizes the importance of being able to explain his models in business terms, so that the business understands how to turn the model into value.

Andy Stand : ‘Understand is what we do’

Remember Andy Stand from our intro blog? Andy does not believe that black-box models are the future. He often finds himself in a meeting where he needs to explain to his colleagues how his model works and how the predictions are generated. He is happy to sacrifice a bit of accuracy in order to achieve the most clear and understandable solution. He makes sure that his planning allows for extensive feature engineering. As a result, all of his features can be understood and explained to others. Simple regressions and decision trees are the most utilised tools from his toolbox.

Why Andy thinks ROC curves are a bad idea to explain your model to business people

‘…And as we can see clearly on this ROC plot, the sensitivity of the model at the value of 0.2 on one minus the specificity is quite high! Right?…’.

Andy has a strong opinion on using the ROC plot to explain a model to non-technical colleagues: ‘If your fellow business colleagues didn’t already wander away during your presentation about your fantastic predictive model, it will definitely push them over the edge when you start talking like this. Why? Because the ROC curve is not easy to quickly explain and also difficult to translate into answers on the business questions your business colleagues have. They are mainly focused on answering questions like Does this model enable us to better target our target audience? How much better are we doing, using this model compared to what we do now? What will the expected response on our campaign be? And these business questions were the reason you’ve built a model in the first place!’

During our model building efforts, we should already be focused on verifying how well the model performs. Often, we do so by training the model parameters on a subset of records and test the performance on a holdout set. We look at performance measures like the ROC curve and the AUC value to decide on the best model version for the task at hand. These plots and statistics are very helpful to check during model building and optimization whether your model is under- or overfitting and what set of parameters performs best on test data. However, these statistics are not that valuable in explaining the business value of the model to your business stakeholders.

One reason that the ROC curve is not that useful in explaining the business value of your model, is because it’s quite hard to explain the interpretation of ‘area under the curve’, ‘specificity’ or ‘sensitivity’ to business people. Another important reason that these statistics and plots are useless in your business meetings is that they don’t help in determining how to apply your predictive model: What percentage of records should we select based on the model? Should we select only the best 10% of cases? Or should we stop at 30%? Or go on until we have selected 70%?… This is something you want to decide together with your business colleagues to best match the business plans and campaign targets they have to meet. The four plots — cumulative gains, cumulative lift, response and cumulative response — we are about to introduce are in our view the best ones for that cause.

Four plots Andy uses to explain predictive models to business people

Andy strongly advocates four different plots to assess a predictive model’s business value:

Gains plot
Lift plot
Response plot
Cumulative response plot

Although each plot sheds light on the business value of your model from a different angle, they all use the same data:

Predicted probability for the target class
Equally sized groups based on this predicted probability
Actual number of observed target class observations in these groups

It’s common practice to split the dataset with predictions into 10 equally large groups and call these groups deciles. Observations that belong to the top-10% with highest model probability in a set, are in decile 1; the next group of 10% with high model probability are decile 2 and finally the 10% observations with the lowest model probability on the target class belong to decile 10.

Each of our four plots places the deciles on the x axis and another measure on the y axis. The deciles are plotted from left to right so the observations with the highest model probability are on the left side of the plot. This results in plots like this:

Now that it’s clear what is on the horizontal axis of each of the plots, we can go into more detail on the metrics for each plot on the vertical axis. For each plot, we’ll start with a brief explanation what insight you gain with the plot from a business perspective. After that, we’ll apply it to our predictive models to predict what employees are most likely to leave the company.

modelplotpy: a brief introduction

There are many packages and pieces of code to be found on the web to create the plots we are about to discuss. Andy strongly prefers the modelplotpy package since this one contains all the plots he prefers and enables him to annotate the plots to explain the model to business colleagues. Modelplotpy can be used to compare the performance of different models (e.g. Meta’s or Aki’s XGB model), on different datasets (e.g. train vs test), and with different target classes (eg. 0 or 1 ). After setting the scope of your analysis, you can plot and annotate the different plots discussed in the next sections. It can be installed as follows:

pip install git+https://github.com/modelplot/modelplotpy.git

load data and models to compare

Now we can load the data we prepared earlier and Meta’s and Aki’s previously built and saved models.

pip install xgboost
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn import metrics
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
#from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curveimport pandas as pd
df_prep = pd.read_csv('https://bhciaaablob.blob.core.windows.net/featurenegineeringfiles/df_prepared.csv')# Load the models:  
import joblib
from urllib.request import urlopen
import xgboost as xgb
from sklearn.model_selection import train_test_split# prepare data for training & evaluation
df = df_prep.drop(columns=['Unnamed: 0','city', 'experience', 'enrollee_id'])  
# Define the target vector y
y = df['target']  # Creating a dataset without the DV:
X = df.drop('target', axis = 1)# Split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.25, stratify=y, random_state=1121218 
)# Creating an object with the column labels of only the categorical features and one with only the numeric features:
categorical_features = X.select_dtypes(exclude="number").columns.tolist()
numeric_features = X.select_dtypes(include="number").columns.tolist()# import model pipelines
xgb_meta = joblib.load(urlopen('https://bhciaaablob.blob.core.windows.net/featurenegineeringfiles/pipe_meta.joblib')) 
xgb_aki = joblib.load(urlopen('https://bhciaaablob.blob.core.windows.net/featurenegineeringfiles/best_pipe_aki.joblib'))

1. Cumulative gains plot

The cumulative gains plot — often named ‘gains plot’ — helps you answer the question:

When we apply the model and select the best X deciles, what % of the actual target class observations can we expect to target?

Hence, the cumulative gains plot visualises the percentage of the target class members you have selected if you would decide to select up until decile X. This is a very important business question, because in most cases, you want to use a predictive model to target a subset of observations — customers, prospects, cases,… — instead of targeting all cases. And since we won’t build perfect models all the time, we will miss some potential. And that’s perfectly fine, because if we are not willing to accept that, we should not use a model in the first place. Or build a perfect model, that scores all actual target class members with a 100% probability and all the cases that do not belong to the target class with a 0% probability. However, if you’re such a wizard, you don’t need these plots any way or you should have a careful look at your model — maybe you’re cheating?….

So, we’ll have to accept we won’t be able to target perfectly. What percentage of the actual target class members you do select with your model at a given decile, that’s what the cumulative gains plot tells you. The plot often comes with two reference lines to tell you how good/bad your model is doing: The random model line and the wizard model line. The random model line tells you what proportion of the actual target class you would expect to select when no model is used at all. This vertical line runs from the origin (with 0% of the cases, you have 0% of the actual target class members) to the upper right corner (with 100% of the cases, you have 100% of the target class members). It’s the rock bottom of how your model can perform; are you close to this, then your model is not much better than a coin flip. The wizard model is the upper bound of what your model can do. It starts in the origin and rises as steep as possible towards 100%. If less than 10% of all cases belong to the target category, this means that it goes steep up from the origin to the value of decile 1 and cumulative gains of 100% and remains there for all other deciles as it is a cumulative measure. Your model will always move between these two reference lines — closer to a wizard is always better — and looks like this:

Now let’s use the cumulative gains plot to see how well Meta and Aki did in predicting which employees are likely to leave the company. And how much of potential ‘leavers’ are missed when — say — a campaign to keep colleagues on board is initiated for the 30% of colleagues that have the highest probability to leave. By targeting only the top 30% investments in this retention campaign are minimized whereas hopefully most of the expected leavers are still targeted. Whether this 30% is actually the right threshold can be evaluated with the plots we’ll discuss.

We first need to create a plotting scope object: We specify the data and models we want to use and the scope we are interested in: comparing performance of models on the test dataset.

import modelplotpy as mp
obj = mp.modelplotpy(feature_data = [X_train, X_test]
                     , label_data = [y_train, y_test]
                     , dataset_labels = ['train data', 'test data']
                     , models = [xgb_meta, xgb_aki]
                     , model_labels = ['Meta`s XGB', 'Aki`s XGB']
                     )# transform data generated with prepare_scores_and_deciles into aggregated data for chosen plotting scope 
ps = obj.plotting_scope(scope='compare_models',select_dataset_label = ['test data'],select_targetclass=[1])

Now, we can plot the cumulative gains chart to compare the difference in how many of all employees that are about to leave are actually in the selection based on the model, if we would only select the top 30% based on the model:

# plot the cumulative gains plot and annotate the plot at decile = 3
_ = mp.plot_cumgains(ps, highlight_ntile = 3)

This plot (as well as the annotation) shows that if we use either Meta’s or Aki’s XGB model it results in selecting about 2/3 of all employees that are likely to leave by only selecting 30% of all employees for a possible employee retention program. We also see that Aki’s parameter tuning resulted in a slightly better model and that using her model would result in targeting slightly more of the acutally leaving employees.

2. Cumulative lift plot

The cumulative lift plot, often referred to as lift plot or index plot, helps you answer the question:

When we apply the model and select the best X deciles, how many times better is that than using no model at all?

The lift plot helps you in explaining how much better selections based on your model are compared to taking random selections instead. Especially when models are not yet used within a certain organisation or domain, this really helps the business to understand what selecting targets based on your model can do for them.

The lift plot only has one reference line: the ‘random model’. With a random model we mean that each observation gets a random number and all cases are devided into deciles based on these random numbers. The % of actual target category observations in each decile would be equal to the overall % of actual target category observations in the total set. Since the lift is calculated as the ratio of these two numbers, we get a horizontal line at the value of 1. Your model should however be able to do better, resulting in a high ratio for decile 1. How high the lift can get, depends on the quality of your model, but also on the % of target class observations in the data: If 50% of your data belongs to the target class of interest, a perfect model would only do twice as good (lift: 2) as a random selection. With a smaller target class value, say 10%, the model can potentially be 10 times better (lift: 10) than a random selection. Therefore, we can not specify a general guideline of a ‘good’ lift. Towards decile 10, since the plot is cumulative, with 100% of cases, we have the whole set again and therefore the cumulative lift will always end up at a value of 1. It looks like this:

This is what the lift plot looks like for Meta’s and Aki’s models:

# plot the cumulative lift plot and annotate the plot at decile = 3
_ = mp.plot_cumlift(ps, highlight_ntile = 3)

Hence, both models more than double the quality of the selection, when compared to randomly selecting 30% of all employees for the employee retention program.

3. Response plot

One of the easiest to explain evaluation plots is the response plot. It simply plots the percentage of target class observations per decile. It can be used to answer the following business question:

When we apply the model and select decile X, what is the expected % of target class observations in that decile?

The plot has one reference line: The % of target class cases in the total set. It looks like this:

A good model starts with a high response value in the first decile(s) and suddenly drops quickly towards 0 for later deciles. This indicates good differentiation between target class members — getting high model scores — and all other cases. An interesting point in the plot is the location where your model’s line intersects the random model line. From that decile onwards, the % of target class cases is lower than a random selection of cases would hold.

# plot the response plot and annotate the plot at decile = 3
_ = mp.plot_response(ps, highlight_ntile = 3)

With the response plot we get per-decile insights in the quality of the model. For Aki’s model, the first three deciles consist of actual leavers for more than 50%. For Meta’s model we see that — whereas the % of actual leavers is above 50% in the first two deciles, it is just below that in decile 3. This insight might help in deciding which model to choose and whether or not to include decile 3 in the selection based on this model.

4. Cumulative response plot

Finally, one of the most used plots: The cumulative response plot. It answers the question burning on each business reps lips:

When we apply the model and select up until decile X, what is the expected % of target class observations in the selection?

The reference line in this plot is the same as in the response plot: the % of target class cases in the total set.

Whereas the response plot crosses the reference line, in the cumulative response plot it never crosses it but ends up at the same point for decile 10: Selecting all cases up until decile 10 is the same as selecting all cases. This plot is most often used to decide — together with business colleagues — up until what decile to select for a campaign.

# plot the response plot and annotate the plot at decile = 3
_ = mp.plot_cumresponse(ps, highlight_ntile = 3)

So, by selecting up until the third decile based on Meta’s model, the expected % of employees that is planning to leave is in total 55% for Meta’s model and 57% for Aki’s model. Knowing the expected percentage of employees in the selection that is likely to leave might really help business colleagues in the investments they are willing to do in the employee retention program. This is why Andy is so much in favor of this and the previous plots. Together they tell the story of what the value of the predictive model is and how to use it to achieve business goals.

With these plots, Andy is able to talk with business colleagues about the actual value of his predictive models, without having to bore them with technicalities and nitty gritty details. He is able to translate his model in business terms and visualizes it to simplify interpretation and communication. Hopefully, these plots also help many of you in discussing how to optimally take advantage of your predictive model building efforts.

This article is of part of our No Data Scientist Is The Same series. The full series is written by Anya Tonne, Jurriaan Nagelkerke, Karin Gruijs-Vodde and Tom Blanke. The series is also available on theanalyticslab.nl.

An overview of all articles on Medium within the series:

Do you want to do this yourself? Please feel free to download the Notebook on our gitlab page.