It turns out that it’s shockingly easy to do some very reasonable things with data (aggregate, slice, average, etc.), and come out with answers that have 2000% error! In this post, I want to show why that’s the case using some very simple, intuitive pictures. The resolution comes from having a nice model of the world, in a framework put forward by (among others) Judea Pearl.
We’ll see why it’s important to have an accurate model of the world, and what value it provides beyond the (immeasurably valuable) satisfaction of our intellectual curiosity. After all, what we’re really interested in is, in some context, what is the effect of one variable on another. Do you really need a model to help you figure that out? Can’t you just, for example, dump all of your data into the latest machine-learning model and get answers out?
What is bias?
In this second post (first here) in our series on causality, we’re going to learn all about “bias”. You encounter bias any time you’re trying to measure something, and your result ends up different from the true result. Bias is a general term that means “how far your result is from the truth”. If you wanted to measure the return on investment from an ad, you measured a 1198% increase in searches for your product, instead of the true 5.4% . If you wanted to measure sex discrimination in school admissions, you measured strong bias in favor of men, when it was actually (weakly) in favor of women.
What causes bias? How can we correct it, and how does our picture of how the world works factor in to that? To answer these questions, we’ll start with some textbook examples of rain and sidewalks. We’ll return to our disaster example from the last post, and compare it with something called “online activity bias”.
A Tale of Wet Sidewalks
Judea Pearl uses a simple and intuitive example throughout his discussion of paradoxes. We’ll borrow his example, mainly for its clarity, and then move on to some other examples we might care more about.
In this example, we’re examining what causes the sidewalk to get wet. We have a sprinkler that runs on a timer, and gets the sidewalk wet whenever it comes on. We also know that the sidewalk gets wet whenever it rains. We record these three variables every day, and come up with a nice data set. Our diagram summarizes our model for how the world works, and it suggests a nice sanity check: we can check to see if the rain is correlated with the sprinkler being on. When we do this on our hypothetical data set, we find that it’s not. Everything looks good!
Now, consider a thought problem. What if I know (1) that the sidewalk is wet, and I know that (2) it didn’t rain. What does that tell me about whether or not the sprinkler was on?
If we remove one explanation for the sidewalk being wet (we know it didn’t rain), then the others have to become more likely! If you know that the sidewalk is wet, suddenly knowing that it didn’t rain tells you something about whether the sprinkler is on. In the context of our knowledge about the wetness of the sidewalk, the sprinkler and the rain become statistically dependent! This isn’t a harmless effect. Let’s spend another minute trying to understand what’s going on.
If we restricted our data to only include days when the sidewalk was wet, we’d find a negative relationship between whether it has rained and whether the sprinkler was on. This happens for the reason we’ve been talking about: if the sidewalk is wet, and it hasn’t rained, then the sprinkler was probably on. If the sidewalk is wet and the sprinkler wasn’t on, then it has probably rained. Even though the two are uncorrelated in the original data, in the restricted data they are negatively correlated!
This happens because we’re not examining the world naively. We know something. If “the sidewalk is wet” and “it didn’t rain”, then “the sprinkler was probably on”. Statements like “If … then …” are called “conditional” statements. When we’re reasoning in the context of knowing something (the part that follows the “if”, before the “then”), then we’re talking about “conditional” knowledge. We’ll see that conditioning without realizing it can be extremely dangerous: it’s causes bias.
It turns out that this effect happens in general, and you can think of it in terms of these pictures. Conditioning on a common effect results in two causes becoming correlated, even if they were uncorrelated originally! This seems paradoxical, and it has also been called “Berkson’s paradox”. Looking at the diagram, it’s easy to identify a common effect, and trace the variables upstream from it: we know that conditional on this common effect, all of the upstream variables can become dependent.
We can put precise math terms on it for anyone who is interested (understanding the rest of the article doesn’t depend on understanding the next two sentences). The sprinkler and rain are independent, but they are not conditionally independent. Conditional independence doesn’t imply (and it not implied by) independence.
Now, we can see how the same structure leads to a type of bias that can easily happen in an experiment.
Do Math Nerds Have Poor Social Skills?
You’re applying for a job, and a company will hire you either if you have very good social skills (and are competent technically), or if you have very good technical skills (and are competent socially). You could have both, but having very good skill at one or the other is a requirement. This picture of the world looks something like fig. 2. Look familiar?
If you’re only looking at people within the company, then you know they were all hired. Possibly without realizing it, you’ve conditioned on the fact that everyone was hired (think: the sidewalk is wet). In this context, knowing someone has great social skills makes it less likely that they have great technical skills (and vice versa), even though the two are uncorrelated in the general population.
This effect introduces real bias into experiments. If you’re doing online studies (even randomized AB tests!) on a website, you’re conditioning on the fact the person has visited your site. If you’re doing a survey study at a college, there can be bias due to the fact that everyone has been admitted. Bias introduced from this kind of conditioning is called “selection bias”. The situation is worse: bias is introduced even if we’re conditioning on effects of being hired, like job title or department (e.g. by surveying everyone within a department). Conditioning on downstream effects can introduce bias too!
From these examples, you might conclude that conditioning is a terrible thing! Unfortunately, there are also cases where conditioning actually corrects bias that you’d have without conditioning!
It turns out that the picture is the key. Before, we were considering bias due to conditioning on common effects (variables where arrows collide). Now, we’ll switch the arrows around, and talk about bias due to not conditioning on common causes (variables from which arrows diverge).
Consider the (simplified) disaster example from last time, in fig. 3. In this picture, a disaster might cause traffic. It also might cause my alarm clock to fail to go off (by causing a power failure). Traffic and my alarm going off are otherwise independent of each other.
If I were to check whether traffic was correlated with my alarm going off, I’d find that it was, even though there’s no causal relationship between the two! If there is a disaster, there will be bad traffic, and my alarm will fail to go off. Unplugging my alarm clock doesn’t cause traffic outside, and neither does traffic (say, from sporting events) cause my alarm clock to fail to go off. The correlation is spurious, and is due entirely to the common cause, the disaster, effecting both the alarm and the traffic.
If you want to remove this spurious relationship, how can you do it? It turns out that conditioning is the answer! If I look at data where there is no disaster, then I’ll find that whether my alarm goes off and whether there is traffic is uncorrelated. Likewise, if I know there was a disaster, knowing my alarm didn’t go off doesn’t give me additional information (since I already know there was a disaster) about whether there will be traffic.
Real World Bias
Bias due to common causes is called “confounding”, and it happens all the time in real contexts. This is the source of the (greater than) 1000 percentage point bias we mentioned in the introduction. The world actually looks something like figure 4. It’s the reason why it’s wrong to naively group objects by some property (e.g. article category) and compare averages (e.g. shares per article).
In this picture, we’re interested in whether people search for a product online. We want to understand how effective an advertisement is, and so we’d like to know the causal effect of seeing the ad on whether you’ll search for a product. Unfortunately, there’s a common cause of both. If you’re an active internet user, you’re more likely to see the ad. If you’re an active internet user, you’re also more likely to search for the product (independently of whether you see the ad or not). This kind of bias is called “activity bias,” and the effect you’d measure without taking it into account is more than 200 times greater than the true effect of the ad.
Fortunately, experiments can get around this problem. If you randomize who gets to see the ad, then you break the relationship between online activity and seeing the ad. In other words, you delete the arrow between “activity” and “sees ad” in the picture. This is a very deep concept worth its own post, which we’ll do in the future!
You could also remove the bias by conditioning, but that depends strongly on how good your measurement of activity is. Experiments are always the first choice. If you can’t do one, conditioning is a close second. We’ll also detail some different approaches to conditioning in a future post! For now, let’s try to draw out the basic conclusions.
To Condition or Not to Condition?
We’ve seen that bias can come from conditioning when you’re conditioning on a common effect, and doesn’t exist when you don’t condition. We’ve also seen that bias can come from not conditioning on a common cause, and goes away when you do condition. The “back-door criterion” tells you, given any sufficiently complete (strong caveat!) picture of the world, what you should and shouldn’t condition on. There are two criteria: (1) You don’t condition on a common effect (or any effects of “Y”), and (2) you do condition on common causes! This covers all of our bases, and so applying the back-door criterion is the solution, but the requirement that you have to know the right picture of the world is strong one. This leaves the question open about how we should “do science”. Do we try to build the picture of the world, and find the right things to condition on, so we can estimate any effect we like? Is the world sufficiently static that the picture changes slowly enough for this approach to be okay? (Pearl makes arguments that the picture is actually very static!). Or will it only ever be feasible to do one-off experiments, and estimate effects as we need them. Physics certainly wouldn’t have gotten as far as it has if it had taken the latter approach.
Finally, you may have realized that conditioning on “the right” variables is a stark contrast to the usual, dump-all-of-your-data-in approach to machine learning. This is also a topic worth its own post! It turns out that if you want the best chance of having a truly “predictive” model, you probably want to do something more like applying the back-door criterion than putting all of your past observational data into a model. You can get around this by being careful not to put downstream effects into the model. The reason why has to do with the “do” operation, and the deep difference between intervention and observation. It’s definitely worth its own article!