We hear the terms “correlation” and “causation” a lot, but what do they actually mean?
Correlation: defines how two variables relate with each other when they change. When one variable increases, the other my increase, decrease or remain the same. For example, when it rains more, people tend to buy more umbrellas.
Causation: implies that one variable causes another variable to change. For example, we can confidently conclude that more rain causes more people to acquire umbrellas.
In this post I am going to explore more on the meaning of the terms and try to explain a way of deciding how they relate. I will use a real world example to do the exploration and explanation.
Survey completion rate example
Our organization helps other organization to carry out surveys. They use our platform to design, send and analyze the results of surveys. There are different channels that our users can use for their surveys. These are SMS, USD, IVR, Android app and web. SMS is however the mostly used channel.
Our main goal is to make our customers happy by giving them a platform that enables them to understand their customers more. One thing that we really wanted to do was to come up with factors that lead to more people completing surveys. This would enable our customers to hear more from their customers.
The first task that we did was to find how various factors relate with the completion rate. The completion rate of a survey is the percentage of people who complete a survey after being invited to take part in the survey. We came up with different factors that we thought could affect the completion rate of surveys. Here are the features:
- post_incentive: The incentive (a small amount of money or airtime) offered after completing the survey
- invite_day_of_month: The date of the month a respondent was invited to the survey
- invite_day_of_the_week: The day of the week a respondent was asked to take part in the survey
- invite_hour: The hour of the day the respondent was invited to the survey
- num_questions: The number of questions in the survey
- reminded: whether the respondent was reminded to complete the survey or not
- channel: The manner in which the survey was done. These were either by use of SMS, USSD, IVR, web, or Android app. SMS is the most popular channel and accounts for over 90% of surveys
- completion_rate: The completion rate, the percentage those invited to the survey who completed
We used surveys from the beginning of 2017 to August of 2017 to look for the correlations between the sample factors above. The correlations between the factors are shown in the table below. Since the focus was more on how the completion_rate relates with other factors, I will focus on those relationships more.
The rows of the table are arranged in a descending order of the correlation between completion rate and other factors. Looking at the table, invite_hour with a positive correlation of 0.25 is the factor with strongest correlation with the completion rate. It is then followed by reminded while invite_day_of_the_month is the most negatively correlated with the completion_rate. The correlation between any other factors can also be obtained from the table, for example the correlation between number_of_questions and reminded is 0.05.
The bigger the correlation magnitude, the stronger the correlation relationship. A positive correlation indicates that the when one factor is increased the other should also be increasing. For a negative correlation value, the relationship is inverse. When one increases, the other decreases.
The findings above can lead to wrong conclusions if one is not careful. For example, a conclusion can be made that the invite hour with a correlation of 0.25 has the highest influence on the completion_rate of a survey. As a result, you might start trying to find the right time to send out surveys with the hope of getting more of them completed. With this mentality, it might be concluded that some invite hour is the optimum time to send out a survey. But that would be to hold to the (incorrect) idea that correlation implies causation. The high correlation may mean that either one factor causes the other, the factors jointly cause each other, the factors are caused by a separate third factor or even that the correlation is as a result of coincidence. This can be simply be observed in the figure below
We can therefore see that correlation does not always imply causation. With careful investigation however, it is possible to determine if a specific correlation really implies that one variable causes the other.
How can we verify correlation implies causation?
- Use statistically sound techniques to determine the relationship.
Ensure that you use statistically legitimate methods to find the correlation. These include:
- use of variables that correctly quantify the relationship
- make sure there are no outliers
- ensure the sample is an appropriate representation of the population
- use of an appropriate correlation coefficient based on the scales of the relationship metrics
2. Explain the relationships found
- exposure always precedes the outcome, if A is supposed to cause B, check that A always occurs before B
- check if the relationship ties in with other existing theories
- check if the proposed relationship is similar to other relationships in related fields.
- check if there is no any other relationship that can explain the relationship. In the case above, a proper explanation for the headaches could be drinking instead of sleeping with shoes
3. Validate the relationships
- The explanations found above should be tested to determine if they are true or false. The common methods of testing are experiments and checking for consistency of the relationship. An experiment usually requires a model of the relationship, a testable hypothesis based on the model, incorporation of variance control measures, collection of suitable metrics for the relationship, and an appropriate analysis. Experiments done several times should lead to consistent conclusions.
We have not yet carried out these tests on our response rate correlations. So we don’t yet know, for example, whether particular invite hours cause higher response rates.
We need to be careful before concluding that a particular relationship implies causation. It is generally better not to have a conclusion than to land on an incorrect one which might lead to wrong actions being taken!