Analytics Vidhya
Published in

Analytics Vidhya

How to Combat Confounding Effect?

Confounding can be increasingly misleading!

We as humans, infer many rules and associations between two or more phenomena based on the observations we perceive via our sensing system. Confounding might cause illusion of understanding since humans get used to come to a quick conclusion based on very limited observations. So, what is the confounding effect? We are going to answer this question and also propose some approach to lessen its effect on our decision process. The ultimate goal is reaching very reasonable and sounding result based on the observations.

What is the confounding effect?

Suppose that you’d like to infer if there is any causal relationship between a set of two variables. In statistics and measurement theory, it is pretty straightforward. You record a paired values and then conduct statistical tests such as Pearson coefficient (PC) to see if there is any relationship (preferably linear) between two variables. However, you should be cautious on using this approach. The mentioned approach only provides the amount of relationship two variable might represent against each other. The casual deduction, the ultimate goal of scientific approach in understanding world, is not guaranteed by the PC analysis. Don’t forget that the ultimate goal of scientific approach is to deduce the casual link between two or more phenomenon. Note that correlation does not imply causation. The confounding effect emerges when the data shows a pretty good relationship but actually there is no relationship at all! The relationship is just an illusion which is governed by a third factor lurking in our observation about which we are totally negligent. This is the confounding effect. It is pretty important to take any confounding factors into account when designing an experiment in order to achieve a valid conclusion.

How Does it work?

The confounding effect emerges from an extraneous variable which has a correlation with independent and dependent variables. This correlation also indicates some kind of casual link, too. By looking at only two variables at hand, you actually look at the relationship between confounding variable and other two variables as a result of projection. If we show it schematically, it would be something like the one shown in figure below:

Confounding effect

The variable Z has a connection with two variables X and Y that we have measured. If we ignore the confounding effect, we might come to a conclusion that X and Y are connected but actually they are not. This sort of illusion that makes us to come to a wrong conclusion is a consequence of projecting effects on both variables by an extraneous variable Z. Neglecting the Z’s effect can mislead us in deriving the casual link between two set of measurements.

As a real-world example, imagine we are asked to find out of there is any relationship between chocolate consumption and Nobel prize winners! We start by collecting samples from different countries around the world. Let’s say, we find high correlation between chocolate consumption and the number of Nobel winners. Should we settle the conclusion and state that the chocolate consumption has a huge effect on making you Nobel winner? The answer is absolute NO!

Where did we do wrong? Isn’t high correlation an indication of (linear) relationship? Of course, high correlation indicates relationship but it does not mean necessarily that this relationship is genuine and not spurious. After all, all these are numbers and interpretation is up to the test context and logics resides inside the problem. If we cannot find any sensible meaning in the relationship, we shouldn't declare any results as the ultimate goal from statistical analysis is finding out the casual relationship between two events.

Chocolate and Nobel prize

So, How did we end up with this kind of situation? The answer is that there might be other factor(s) that derive(s) this relationship and affect both of these two factors. Imagine this factor as wealth of nations. The wealthier a nation, the more research fund is allocated to scientific discovery and hence more Nobel prize winners originate from. And of course, the wealthier a nation, the more likely they afford to buy a chocolate bar (We know chocolate is pretty expensive and is not counted in essential grocery basket). This is the way the correlation meets causation!

Ok now what? After we learnt more about the confounding factor, is there any way to work around this nasty effect? The answer is yes, and it will be presented in what it follows.

Many approaches are proposed to tackle the impact of the confounding factor. Here are some of these appraoch:

  • Restriction: In this approach, we focus on the subset of the samples with a fixed value of the possible confounding factor (Note that we should identify the confounding factor). By fixing the confounding factor, we unbind the connections between two variables. Remember variance is information (connection). The plus side of this approach is that it is easy to implement. However, it limits the observation sample a great deal and might fail to include other confounding factors.
  • Matching: Please note that this approach is applicable in the cases that we have two data sets: one belongs to the controlled tests and the other belongs to uncontrolled one. In this approach, we match two groups of data. We match one sample of the data in one group with a counterpart in other group which both have the same value (range) of confounding factor. By doing this, we want to reassure that the variation occurs between two groups is as a result of the variation in the non-confounding factors. The upside of this approach is it allows you to include more subjects than restriction. However, since you need to find a counterpart for every value of the confounding factor, it might be cumbersome to implement. It also might fail you in considering the other confounding factor.
  • Statistical tool: In this approach, we exploit one of the statistical tools in mitigating the effect of confounding factor called regression. By doing this, we can include the possible confounders as control variable in your regression model. in this way, you will control for the impact of the confounding variable. In other words, you exclude the impact of the confounding factor and limit your attention to the residue to see if the effect emerges. Again, this approach does not guarantee that you have taken all confounding factor into account.
  • Randomization: Another approach is to randomize the samples in two groups. For example, if you have two groups with controlled and uncontrolled samples, mix them all together and then partition the result in two separate groups. By doing this, you unbind the connections the confounding factor might have relative to two groups. Since these factors do not differ by group assignment, they cannot correlate with your independent variable and thus cannot confound your study. In other words, by randomisation, you assure that both groups has the same confounding factor on average. This is the most effective approach in combating the confounding factor as you include all the confounding factors in your study.


In this short article, we aim at briefing the confounding factor as one of the bias errors in statistical inference. We explain how not considering the confounding factor cause misleading results which questions the scientific validity of the result. We also present some effective ways in mitigating the confounding factor.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vahid Naghshin

Vahid Naghshin

Technology enthusiast, Futuristic, Telecommunications, Machine learning and AI savvy, work at Dolby Inc.