The Art of Establishing Cause: A Deep Dive into Causality

Shivani Shekhawat
AI Skunks
Published in
11 min readApr 23, 2023

Shivani Shekhawat

In statistics, there is always a question that comes to the mind of researchers “Why is something happening?” Here the point which comes into focus is the causal inference which can be considered as the family of statistical methods whose main motive is to give the reasons for any happening. We use causal inference to determine the cause of changes in one variable if the changes occur in a different variable whereas standard statistical approaches like regression are being used to determine how the changes in one variable are associated with the changes in another variable.

Let’s suppose there are two variables X and Y. The standard methods here will focus on determining the association whereas the causal inference approaches will be concerned with why the variable X changes if it is causally related to the variable Y so that we can explain changes in X in terms of changes in the Y variable.

Let’s look at another example:

Deforestation and loss of biodiversity are causally linked through a chain of ecological consequences that result from the large-scale removal of trees. Deforestation is the cause, while loss of biodiversity, which refers to the reduction in the variety of plant and animal species in an ecosystem, is the effect.
The causal relationship between deforestation and loss of biodiversity can be explained through the direct and indirect impacts of forest clearing. Direct impacts include the destruction of natural habitats, which can lead to the decline or extinction of species that rely on those habitats for survival. Additionally, deforestation disrupts the ecosystem’s balance, which can result in a cascade of effects on food chains, predator-prey relationships, and other interspecies interactions. Indirect impacts involve changes to the broader environment, such as alterations in hydrological cycles, soil erosion, and increased carbon dioxide levels in the atmosphere, all of which can further contribute to habitat degradation and biodiversity loss. Collectively, these consequences highlight the causal connection between deforestation and the decline in biodiversity, emphasizing the need for sustainable forest management practices to preserve the health and diversity of ecosystems.

Why Causal Inference is important?

At every level of statistics, causal inference is used for providing a better user experience for customers on any platform. We can use the insights of causal inferences to identify the problems related to the customer or problems occurring in the organization. Also, it can be used to improve the customer experience. For example, any product introducing a new feature and the customers are raising complaints about the new feature due to lack of clarity or he is confused about the procedure of using the new feature. In such a scenario we can improve communication about the usage or procedure of usage of a new feature instead of giving news updates on it or dropping down the new feature plans.

Causal inference can be used to make information that can help in improving the user experience and also we can generate business decisions by knowing its impact on the business.

If we can understand the relationship between two intangible variables such as employee satisfaction and business metrics, we will be able to use such information to prioritize tasks and aim for new features and tools. Also, these inferences can help in understanding the short-term and long-term impact of any new decision or program.

Causal inference enables us to answer questions that are causal based on observational data, especially in situations where testing is not possible or feasible. For example, we started a campaign where users of our product can participate and mail their queries and complaints and we want to measure the impact of the campaign on the business. Causal inference enables us to find answers to these types of questions which can also lead to better user experiences on any platform.

The Causal Inference Workflow

Challenges in Causal Inference

Establishing causality is often challenging due to several factors. First, correlation does not necessarily imply causation; a relationship between two variables might be coincidental or a result of confounding factors. Second, reverse causation might be at play, where the presumed effect is the actual cause. Lastly, establishing causality often requires conducting controlled experiments, which can be expensive, time-consuming, or even unethical in some cases.

Methods for Causal Inference

  1. Randomized Controlled Trials (RCTs): RCTs are considered the gold standard for establishing causality in many fields, particularly in medicine. In an RCT, subjects are randomly assigned to a treatment group or a control group, ensuring that any observed differences in outcomes are attributable solely to the treatment.
  1. Natural Experiments: When randomization is not feasible, researchers can exploit natural experiments, which occur when external factors or events create treatment and control groups by chance. By comparing the outcomes of these groups, researchers can draw causal inferences without directly manipulating the variables of interest.
  2. Instrumental Variables: Instrumental variables are used to address confounding and endogeneity issues in observational data. An instrumental variable is an external factor that is related to the independent variable but not to the error term or unobserved factors in a regression model, allowing researchers to isolate the causal effect of the independent variable on the dependent variable.
  3. Difference-in-Differences: This method compares the changes in outcomes between a treatment group and a control group before and after an intervention, under the assumption that any differences in trends would have persisted in the absence of the intervention.
  4. Propensity Score Matching: This technique involves matching treated and control units based on their propensity to receive treatment, as predicted by observed characteristics. By comparing outcomes between matched pairs, researchers can estimate the causal effect of the treatment.

Since causal inference is a combination of various methods connected together, it can be categorized into various categories for a better understanding for any beginner. We can say there can be two categories according to the data. These two categories are :

  • Causal inference with experimental data
  • Causal inference with observational data

Correlation vs Causation

Correlation and causation are two commonly confused concepts in the field of statistics and data analysis. While correlation refers to a relationship between two variables, causation implies a direct cause-and-effect relationship. It’s crucial to differentiate between the two to avoid misconceptions and make informed decisions. This article will discuss the differences between correlation and causation and provide some examples to help clarify the concepts.

Correlation is a statistical measure that expresses the extent to which two variables are related. A positive correlation indicates that as one variable increases, the other variable also increases. Conversely, a negative correlation means that as one variable increases, the other decreases. However, correlation does not imply causation.

Example 1: Ice cream sales and drowning incidents

During the summer months, there is a positive correlation between ice cream sales and the number of drowning incidents. As ice cream sales increase, so do the number of drowning incidents. However, this does not mean that ice cream sales cause drowning incidents. Instead, a lurking variable, such as the hot weather, can be a common cause of both increasing ice cream sales and people going swimming, which leads to a higher risk of drowning.

Have a look at these spurious correlations: https://www.tylervigen.com/spurious-correlations

Causation is a cause-and-effect relationship between two variables. It means that one variable directly influences or causes the change in the other variable. Establishing causation requires more rigorous testing, such as experimental or longitudinal studies, to confirm the causal relationship.

Example 2: Smoking and lung cancer

Numerous studies have shown a strong causal relationship between smoking and lung cancer. These studies have demonstrated that smoking increases the risk of developing lung cancer, and quitting smoking reduces that risk. In this case, there is a clear cause-and-effect relationship: smoking causes an increased risk of lung cancer.

Common misconceptions and pitfalls -

It’s essential to be cautious when interpreting correlations, as they can be misleading.

Some pitfalls include:

  • Confusing correlation with causation: Just because two variables are correlated does not mean that one causes the other
  • Omitted variable bias: A lurking variable may be responsible for the observed correlation between two variables
  • Reverse causation: The causal relationship may be in the opposite direction, with the effect causing the cause instead of the other way around.

Working with Dataset

In this article, we are going to explore Causality using the Dataset ‘Hotel Bookings’ taken from kaggle. First, we will start with the EDA. After analyzing the data we will clean the the data, Hence, we will drop the column ‘company’ since almost 95% values of company is missing so we drop that column

import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x=dataset['hotel'], hue=dataset['is_canceled'], palette='mako')
plt.savefig('Type_of_hotel-is_canceled_1.png')
We can see that city Hotel has high chances of cancelling than Resort
sns.countplot(x=dataset['previous_cancellations'],hue=dataset['is_canceled'],palette="Set1")
plt.savefig("prev_cancellation_is_cancelled_2.png")
We can see that when someone has previously cancelled then there is a high chance that they will cancel again
sns.countplot(x=dataset['is_repeated_guest'],hue=dataset['is_canceled'],palette='rocket')
plt.savefig("repeated_guest_is_cancelled_3.png")
It is obvious that when a person is a repeated guest he likes the hotel/resort and hence there is less chance of him or her cancelling the booking
sns.countplot(x=dataset['deposit_type'],hue=dataset['is_canceled'],palette='Set1')
plt.savefig("deposit_type_is_canceled_5.png")

This visualization that we see is pretty interesting since when the deposit type is non refundable then most of the time the booking is cancelled. As a general thought a person may think if a booking is non refundable then there is less chance for it to be cancelled.

import matplotlib.pyplot as plt
import seaborn as sns

f, axes = plt.subplots(1, 2, figsize=(15,8), sharex=True, sharey=True)

sns.barplot(x=c_month.index, y=c_month['is_canceled'], ax=axes[0], color='darkblue')
sns.barplot(x=nc_month.index, y=nc_month['is_canceled'], ax=axes[1], color='darkblue')

axes[0].tick_params(axis='x', rotation=45)
axes[1].tick_params(axis='x', rotation=45)

axes[0].set_title("Booking canceled")
axes[1].set_title("Booking not canceled")

plt.savefig("is_cancelled_Acctomonth_7.png")

Plotting the Correlation

categorical_features = []
numerical_features = []

for col in dataset.columns:
if(dataset[col].dtype!='object'):
numerical_features.append(col)
else:
categorical_features.append(col)
print(categorical_features)
import seaborn as sns
plt.figure(figsize=(12,7))
sns.heatmap(dataset[numerical_features].corr(),linewidths=2,linecolor='black',annot=True)
plt.savefig("heatmap_6.png")

Modelling with dowhy

A key that unlocks the secrets of cause and effect! This captivating Python library transforms the world of causal inference, guiding users on a thrilling journey through the maze of cause-and-effect relationships. By harnessing the power of advanced statistical techniques and the elegance of graphical models, DoWhy is your trusted sidekick in unraveling the mysteries of data.

Picture yourself as a data detective, navigating a labyrinth of variables, biases, and confounders. DoWhy hands you the magnifying glass and empowers you to specify causal assumptions explicitly through captivating causal graphs. Like a crystal ball, it allows you to gaze into potential alternate realities, automating counterfactual analysis to estimate causal effects and test your assumptions.

In a world where data is the new gold, DoWhy offers you the treasure map. It’s an essential tool for data scientists and researchers seeking to unlock the hidden truths of cause-and-effect relationships, elevating their analyses from mere correlations to true causal insights. Unleash your inner Sherlock Holmes and embark on an exhilarating adventure with DoWhy!

dataset['is_canceled']=np.where(dataset['is_canceled']=='Booking not canceled',0,1)
dataset['is_repeated_guest']=np.where(dataset['is_repeated_guest']=="First time guest",0,1)
dataset['previous_cancellations']=np.where(dataset['previous_cancellations']=="No previous cancellation",0,1)

dataset['different_room_assigned']= dataset['different_room_assigned'].replace(1,True)
dataset['different_room_assigned']= dataset['different_room_assigned'].replace(0,False)
dataset['is_canceled']= dataset['is_canceled'].replace(1,True)
dataset['is_canceled']= dataset['is_canceled'].replace(0,False)
!pip install dowhy
!pip install causalgraphicalmodels
from graphviz import Source
from IPython.display import Image, display
from dowhy import CausalModel

from causalgraphicalmodels import CausalGraphicalModel
graph = CausalGraphicalModel(
nodes=[ 'is_canceled', 'lead_time', "unobserved_confounder",'total_of_special_requests',
'meal', 'country', 'market_segment',
'is_repeated_guest',
'previous_bookings_not_canceled', 'booking_changes', 'previous_cancellation','required_car_parking_spaces',
'days_in_waiting_list',
'total_guests',
'total_days', 'different_room_assigned','agent'],
edges=[
("market_segment", "lead_time"),
("lead_time", "is_canceled"),
("country", "lead_time"),
("different_room_assigned", "is_canceled"),
("unobserved_confounder", "is_canceled"),("unobserved_confounder","lead_time"),("unobserved_confounder","different_room_assigned"),
("country","meal"),
("lead_time",'days_in_waiting_list'),
('days_in_waiting_list',"is_canceled"),
('previous_bookings_not_canceled','is_canceled'),
('previous_bookings_not_canceled','is_repeated_guest'),
('is_repeated_guest','is_canceled'),
('total_days',"is_canceled"),
('total_days',"agent"),
('total_guests','is_canceled'),
('previous_cancellation','is_canceled'),
('previous_cancellation','is_repeated_guest'),
('required_car_parking_spaces','is_canceled'),('total_guests','required_car_parking_spaces'),('total_days','required_car_parking_spaces'),
('total_of_special_requests','is_canceled'),
('booking_changes','different_room_assigned'),('booking_changes','is_canceled')
]
)
G=graph.draw()
G
import statsmodels
model= dowhy.CausalModel(
data = dataset,
graph=causal_graph.replace("\n", " "),
treatment="different_room_assigned",
outcome='is_canceled')
#Identify the causal effect
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
dataset['different_room_assigned'].value_counts()

Causal ML graphs, also known as causal diagrams or directed acyclic graphs (DAGs), are fascinating graphical representations that help us better understand the causal relationships between variables. Here are some interesting facts about causal ML graphs:

  1. DAGs: Causal ML graphs are directed acyclic graphs, which means that they consist of nodes (representing variables) connected by directed edges (arrows) without any cycles. This structure prevents feedback loops and ensures a clear flow of causality.
  2. Confounder identification: These graphs are extremely useful for identifying confounding variables, which are variables that influence both the cause and the effect, thus creating a spurious relationship between them.
  3. Visualization of assumptions: Causal ML graphs allow researchers to visualize and communicate their causal assumptions explicitly. This clarity helps in assessing the validity of the assumptions and in identifying potential biases in the analysis.
  4. Testable implications: By using causal ML graphs, researchers can identify testable implications, such as conditional independencies, which can help validate the causal assumptions and estimate the causal effects.
  5. Intervention analysis: Causal ML graphs enable researchers to perform “what-if” analyses by simulating interventions on variables, which helps estimate the causal effect of one variable on another under different conditions.

Conclusion

Causality is an intricate and fascinating concept that permeates our understanding of the world. While challenges and limitations exist, the pursuit of causal relationships remains crucial in advancing human knowledge and shaping our decisions. As we continue to explore the tapestry of cause and effect

License

All code in this notebook is available as open source through the MIT license.

All text and images are free to use under the Creative Commons Attribution 3.0 license. https://creativecommons.org/licenses/by/3.0/us/

These licenses let people distribute, remix, tweak, and build upon the work, even commercially, as long as they give credit for the original creation.

Copyright 2023 AI Skunks https://github.com/aiskunks

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

References

--

--

Shivani Shekhawat
AI Skunks

MSIS Graduate Student at Northeastern University | Exploring the world of Data