A tricky part of Applied Data Science: SELECTION BIAS

Classification Models in Commercial Use

Fernando Barbosa
CodeX
5 min readJun 16, 2022

--

There are many classification models in commercial use, customer retention, debt collection, cross-selling, up-selling… you name it!

Suppose your next model is ready and set to go! You feel anxious to put it to production as you hold validation and deployment meetings. So far so good.

Photo by DESIGNECOLOGIST on Unsplash

At this point, stakeholders are maybe really impressed by that algorithm with a tongue-twisting name you just pronounced and that may also have increased expectations a bit. Sort of a buzz effect, a buzz kill in disguise.

The Buzz goes to Operations Town

Word runs to business units which will make use of the predictions in order to make decisions and take action:

  • For instance, the top 20% of customers that are more likely to churn could receive a 💘 leave-me-not special offer.
  • Another example is a selection of potential customers who need that extra nudge to sway their opinions and close the deal 🤝.

Now, this is a critical moment. This very selection is likely to come back to haunt you in the future. And that is the tricky part!

Only now, after a round of meetings, more fuel has been added to that expectation fire: a boost in the company’s performance could come from deploying such predictions.

The causality assumption

If no smart deployment experiment is prepared, then a green light for production may be the way to go 😕. Feedforward to production as predictions start being used and it may feel like it’s a job well done.

After a few production rounds along with ex-post data, clients see their business metrics improve more than ever. The twist is that such improvement is assumed to be thanks to the one model that was just delivered.

That assumption bares, at least, two important components: one is in the author’s bias towards his own work, whether for better or worse. Another is the counterfactual: what would the business metrics be if it were a different model or even no model at all?

In order words, the answer to “why did such business metrics improve?” is often taken to be: “due to a better model in production”.

A causality discipline. Hopefully, without harm, this could be considered a situation that requires an experimental design and a careful assessment of the inference with a causality approach.

Have in mind, though: “Careful doesn’t imply it is endless!”.

Look out for drifts!

How can an operation rely on high-frequency decision-making machines or Machine Learning powered applications to help them?

Many challenges lie ahead in the making of low-cost and maintainable software that can increase confidence that an experiment-driven strategy will perform well on 💡different (new) data.

💡 Different here refers to other than the ones used to train your model.

https://figshare.com/articles/presentation/Barriers_to_reproducible_research_and_how_to_overcome_them_/5634136

Now, what the heck is drift? Well, the first part of Valerio Maggio’s talk is somewhat dedicated to explaining it. He also provides reference to Kirstie Whitaker’s work, source of the above chart.

Once you found a successful method for something, you’ll surely want to replicate and scale it if you belong in the business sector.

This tends to be a huge challenge since a lot of what you left out of your training and validation (samples) selection will most likely be unknown to you in the future. How so?

You don’t know what you don’t know

Well, you do know the result of that fish who took the bait. You do not know, however, the result for the fish who did not take the bait.

💡 The sale realized, the recovered money, the converted customer, and so on. These are known to you because you actually recorded it. But what about he sale that never took place? The recovery that never happened or the unrealized conversion?

For example: you predict a customer is very likely to purchase a pair of shoes on this online shoe-shop. You label them potential shopper. Now, it turns out they left without even clicking on your ‘add to card’ or ‘view more’ fancy buttons.

👉 The catch: did they buy it elsewhere? Maybe, they never actually bought it.

To make matters worse, some processes require a sequence of selections, creating a chain of ever-growing unknown outcomes. For now, let’s concentrate on a single selection. That is, for a given population, you get a sample of ‘people’ according to the rule proposed by your own algorithm.

All this to make this point: when you deliver a new model that makes decisions so as to select part of the input population, you are likely to be introducing bias in the data you will later use to confirm predictions. By its very definition.

If all went as planned, you do not have data that is representative of the ‘population’ which you’d wish to make predictions about.

💡 Next time you feel like jumping straight to model.fit (a popular command-name used to train models), make sure you understand and find ways to tackle bias in the selective process your data has been through.

Otherwise, your next model won’t do as well and it can be very difficult to explain why 😔.

Remember those nice metrics your boss got really impressed by. Well, some of them could be AUC (ROC), Precision, Recall and so forth. They will reflect the proportion of the population you can actually confirm the true outcome, thus it is subject to the same biased process.

Over the following lines, I’ll provide some resources that have been quite useful for me.

Let me know your thoughts as critique is welcome.

A note on Experimental Design:

Profit-oriented companies will hardly agree on fully randomized trials* so there is no point trying to have that in business endeavors. It just doesn’t stick.

One thing a well prepared experimenter will have is a design that helps them identify and understand the factors that are likely to have caused the observed outcomes .

Resources

Bias and Ethics

This great article by Jaspreet explores the different types of bias, including discriminatory bias. It is a must-read.

Seed and randomization

Towards the end of this talk, Maggio shows one way to tackle bias by using seed-fixing and Sickit-learn’s StratfiedKFold.

Propensity score

This other method presented by Lucas Bernardi proposes you use an ‘equivalent’ target variable so as to train your model and remove the variables which stand out for such an equivalent model.

In a way, it could reduce performance metrics for the models with the benefit of having less disparate metrics in development, validation and production sets.

One may argue that I’m mixing causality and bias here and to some extent that is true. But I’d like to focus on the bias aspect first and leave causes for another day.

Thanks for reading!

--

--

Fernando Barbosa
CodeX

Data Scientist with special interests: MLOps, Design of Experiments and Prescriptive Analytics.