How to understand the most important graph in machine learning
A data scientist continuously tracks a huge number of metrics as she builds a model to hone in on the best outcome. Not only can one seek to build the best model, a skilled data scientist can also calibrate a model to a given situation. A classic example is the medical test designed to be a cheap screener that trades finding some False positives for not missing any True positives. Every model is a unique blend of situation, available data, and the final implementation.
Given the huge number of tools available to a data scientist, it is impossible to boil down all of those measures into a single understandable visualization for the non-data scientist to understand. The diligent project manager is thus faced with a problem — how to understand the model built by a data scientist that she the PM will be implementing? Without a background in statistics this can be a daunting task for the PM to understand or the DS to explain.
However, the implementation of a machine learning model can be understood with a more consistent framework than the building of one. By looking at how a machine learning model is implemented with a limited budget, we can understand the most important graph in machine learning.
Implementation Example
Let’s use the example of a direct mail campaign to understand how a machine learning model will be implemented. An organization wants start a campaign to send mailers at 50¢ each to its list of subscribers. Each additional mailer adds to the cost, so we want to limit our sends to only those subscribers with the highest propensity to convert. After building our very best model we would have values from zero to one for every person on our list that represent their individual probability to convert during our campaign.
We can then rank order these to see who the subscribers with the highest propensity to convert are. This still leaves us with a decision, who do we send to? How low on the list do we go? In an ideal world we can set a threshold that maximizes return like below:
But the real world is fuzzier. An organization might typically send to everyone on its list and a limited send would face substantial pushback. The model might also predict efficient results at a greater scale than the budget allows; the program manager needs substantial proof to ask for more money. Either scenario is going to lead to a situation where we have to say why we know what we know to a group of non-technical people.
A data scientist cannot possibly communicate all the validation that goes into building a model but also needs to be able to prove the model’s validity in the real world. We have A/B tests for this purpose, but how do we even get approved to run one?
Enter the Quantile Validation plot…
The Quantile Validation (QV) Plot
A QV plot is something every DS will build for themselves while building a model. In our direct mail example we break our predicted propensity up into quintiles — that is a 5-quantile — and see what the actual average conversion was in the data we used to build the model. If those match, then on average, the model is matching the data.
The QV plot above shows a complete story of the implementation of the model. The model (in red) predicts the bottom 20% of subscribers on average to convert at 9%. Looking at the source data we see a 10% actual conversion rate (in blue). Similarly the 20% of subscribers with the highest propensity are predicted to convert at 88% and actually convert at about 90%. The model matches the data well at both the high and low ends.
If we were to implement this model, we could reliably tell what conversion rate we would expect from every group. The model performs well for both easy and hard to get targets; even looking in the middle quintiles we can accurately create a rank order between quintiles 2, 3 and 4.
Any PM would be happy to present that to an executive. That validation essentially means that we know a lot about our population. The real world might look more like the below:
In this scenario, we still have a very good guess as to what the highest and even the middle propensity subscribers will do.We are easily able to tell subscribers who are somewhat likely from subscribers who are not very likely to convert. However, the model results at the low end do not match reality, we could not reliably predict the conversion rate of low propensity subscribers so we cannot accurately calculate the most efficient cut off.
Any strategy to implement this model has to account for that inaccuracy at the low end. If we want to mail everyone but the least likely subscribers, we would need to come up with a new strategy. If our budget only allows us to mail a small audience we will only mail the highest propensity people; that plan would not need to change given this model weakness.
Finally, a bad QV plot is obvious even to the non-technical audience.
This model cannot tell good from bad. The lowest propensity subscribers actually converted at a similar rate to the highest propensity subscribers according to this model. If a DS shows you a Q-Q plot that looks like this — raise the alarm.
The Quantile Validation plot is the easiest to understand validation plot in the data science toolkit. It is perfect for communicating with a non-technical audience shining light on what an implementation can and will ultimately look like. Make sure to ask for one as you work with new models.