Confounding via experimenting

One of the fundamental problems with current ML techniques

Let’s assume you are working with a new company, and you want to figure out whether or not a new article is worth featuring on the front page (let’s call this article “A”). As a data scientist, you want to figure out whether or not to deliver A on the front page, deliver A on the front page only for some particular people, or never display A on the front page.

For the sake of simplicity, let’s just assume that the metrics we use to judge article quality are how many people finished reading the article.

Let’s say we run an experiment and put A on the front page randomly 1000 times. The numbers we get from this are:

  • 120 of the people that saw A on the front page finished reading it
  • 860 of the people that saw A on the front page didn’t read it

Yielding a read ratio of 12%.

That very same day, we featured a bunch of different article on the front page, 10,000 times. The numbers we get are:

  • 1,000 of the people that saw another article on the front page finished reading it
  • 9,000 of the people that saw another article on the front page didn’t read it

Yielding a read ratio of 10%.

Thus, the new article is obviously superior; it’s 1.2x times better than other articles at making people read it, we should feature it on the front page more often.

Repeat readers

But wait, one of the variables we could have previously collected from the users visiting the front page is what articles they’ve previously read.

Armed with this new variable, we could try to look for a newer, more precise pattern of whom to feature A on the front page for. Let’s say that we want to see if people that have already read A are interested in reading it twice.

Visitors that never read A:

  • 60 saw A on the front page and finished reading it
  • 740 saw A on the front page and didn’t read it

Yielding a read ratio of 7.5%.

  • 720 saw another article on the front page and finished reading it
  • 8280 saw another article on front page and didn’t read it

Yielding a read ratio of 8%.

Visitors that have already read A:

  • 40 saw A on the front page and finished reading it
  • 160 saw A on the front page and didn’t read it

Yielding a read ratio of 20%.

  • 280 saw another article on the front page and finished reading it
  • 1000 saw another article on front page and didn’t read it

Yielding a read ratio of 28%.

So displaying A is worst both in repeat readers and new readers, but, better in aggregate? That makes absolutely no sense when we put it like that.

Well, we can make sense of it if we take a closer look at the data.

What we’ll notice is that people that have already read A tended to read articles way more often in general. What we’ve done here, is over-sampled that group of those that have already read A when delivering A on the front page (20% of visitors when delivering A vs 10% when delivering the other articles).

Thus, we’ve unknowingly introduced a potential confounder in our analysis, the biased delivery of A to previous readers of A.

Of course, we could try controlling for this confounder and sample in such a way as to not bias our delivery of A towards people that have previously read A.

The problem is that requires us to re-run our experiment. Thus, potentially displaying A too much (if it’s a bad article) or too little (if it’s a quality article).

Even if we re-run the experiment or remove some of the observations to remove bias from our data we are still left with a strange phenomenon. People that read A seem to read more articles.

Here are two hypotheses we can easily think of about this seemingly strange occurrence:

  • a) Reading A somehow increase the read ratio on all further articles, possibly because the quality of the article has encouraged our visitors to read other articles. This trend is less exacerbated for A itself because those visitors have already read it.
  • b) Visitors that have already read A once are “heavy readers” that tend to read a lot more articles than our usual audience.
Hypothesis a (left) and Hypothesis b (right)

In reality, it’s very likely that neither hypothesis B nor hypothesis A are 100% correct.

Of course, we could change our analysis and only sample first-time readers or heavy-readers. But then, we’d simply be trading one confounder for another.

To figure out whether or not we should feature A, we need to investigate a few other issues, such as:

  • Does reading A cause users to read more articles than reading other articles?
  • Are the people in our second cohort (re-readers of A) more likely to be “heavy readers “ and by how much?
  • Do people who read A in the first place tend to come back to our website more often?

Note that, any of the results from these investigations might have confounders of their own, which might require further experiments and investigations to elucidate.

Problems with ML models and confounders

This is a summary of two of the fundamental problem of many applied data science issues. Namely, unknowingly biased datasets and the making of an observation (in this case, displaying A to a user and seeing if he reads it or not) resulting in changes to further observations.

Many times, we need to gather data in order to build a model, but, the experiments we perform in order to test said model are in of themselves going to alter the data.

Even more, once we deploy a model, based on previously collected data, that model is going to alter any further data, making it hard to figure out if the model is working as intended.


Still, as humans, we can make simple inferences like the ones above. A machine learning model, on the other hand, might have a hard time figuring out that a dataset is biased. It will certainly be impossible for it to detect whether or not the experiment we used to collect that data introduced that bias.

The problem here is, once again, twofold.

First, off the shelf models don’t try to spot potential confounders in data. Thus, they can’t give the users information about potential causal relationships that might bias their predictions.

Secondly, even if we run an analysis of the data before feeding it to the model, and spots some causal relationships between the data, there’s no way to inform the model about it.

Most of the “bleeding edge” machine learning models of today have no way of interpreting a simple input such as Column X, and Z might be influencing column Y via roughly this equation.

We need to try and tweak our data to reflect these prior hypotheses, instead of informing the model itself about them.

I don’t claim to have a good solution to this problem, and I don’t believe one exists as of yet; That’s one of the main reason why human brains are (sadly or happily enough) still required to advance our understanding of the world.

Machine learning models are bad at spotting causal relationships because that would require them to know information that isn’t present in the data itself. It may be that a good start to solving the problem is creating models in which we can input various hypotheses about the possible ways our data interact. Some such models already exist, but there isn’t much research into this topic when it comes to deep learning models, possibly because it’s easier said than done.

But, hopefully, this example is a way for me to better illustrate this problem to people that might be unaware of it, or that previously didn’t understand it from this causal point of view.