Crash Course in Causality: Smoking Leads to Death?

Published in

AI Skunks

8 min readApr 23, 2023

Shalini Shree, Nik Brown

In today’s data-driven world, the ability to extract meaningful insights from complex datasets has become a vital skill. Causal inference, a cornerstone of modern data science, is a powerful tool that empowers researchers and decision-makers to understand the underlying relationships between variables and determine the impact of interventions. At its core, causal inference helps us answer the fundamental question: “What would happen if we did X?” This intriguing branch of statistics transcends mere correlation, enabling us to establish cause-and-effect relationships that are paramount for informed decision-making in fields ranging from healthcare and economics to public policy and marketing.

This article will utilize a sample dataset that includes data on deaths caused by smoking among different age groups and genders in France. The objective is to explore whether there is a causal relationship between smoking and death or if there is only a correlation between the two variables.

What is causality?

Causality refers to the relationship between cause and effect. In other words, it refers to the idea that one event (the cause) can directly or indirectly result in another event (the effect). Causal relationships are important in many fields, including science, economics, and social sciences.

In causality, we use the treatment and outcome variables to investigate whether a causal relationship exists between them. The treatment variable is the variable we want to investigate as a potential cause of the outcome variable. The outcome variable is the variable we want to investigate as the effect of the treatment variable.

We start by defining a hypothesis about the causal relationship between the treatment and outcome variables. We then collect data on both variables and any potential confounding variables. We then use statistical methods, such as regression analysis or experimental design, to estimate the causal effect of the treatment variable on the outcome variable, while controlling for the potential confounding variables. The goal is to determine whether the treatment variable has a significant effect on the outcome variable, and if so, the size and direction of the effect. If a significant causal relationship is found between the treatment and outcome variables, this can provide evidence for a potential cause-and-effect relationship between them.

A confounding variable is a variable that affects both the independent variable (treatment) and the dependent variable (outcome), making it difficult to determine the true causal effect between them by distorting the relationship between treatment and outcome, leading to incorrect conclusions about causality.

Why is causal inference important?

Causal inference is important because it allows us to understand the relationship between variables and determine whether one variable is actually causing changes in another. This information is crucial in many fields, including medicine, public health, economics, and social sciences, as it can help us make informed decisions and create effective interventions.

Furthermore, causal inference is important because it allows us to move beyond simply observing correlations between variables and making statements about the actual causal relationship between them. This helps us avoid making incorrect assumptions about the relationship between variables, which can lead to ineffective interventions and wasted resources.

For example, in medicine, causal inference can help determine whether a particular treatment is effective in improving patient outcomes, or whether a certain risk factor is contributing to the development of a disease. In public health, causal inference can help identify the factors that are contributing to the spread of a disease and guide the development of interventions to control its spread.

What are the differences and similarities between causality and correlation?

Causality and correlation are both concepts that describe the relationship between variables, but they differ in important ways.

Correlation is a statistical measure that describes the strength and direction of the relationship between two variables. It measures how much one variable changes as the other variable changes, and ranges from -1 to +1, where -1 indicates a strong negative correlation, 0 indicates no correlation, and +1 indicates a strong positive correlation. Correlation does not imply causation, meaning that just because two variables are correlated, it does not necessarily mean that changes in one variable cause changes in the other.

Causality, on the other hand, refers to the relationship between cause and effect. It implies that changes in one variable cause changes in another variable. Causality can be inferred through various methods, including randomized experiments, natural experiments, and observational studies using causal inference techniques. Establishing causality requires more evidence than establishing correlation, as it involves ruling out alternative explanations for the observed relationship.

Let's take a look at our Example dataset

What is DoWhy Library?

DoWhy is a Python library for causal inference that provides a unified interface for estimating causal effects using different methods. It is designed to facilitate causal analysis and automate the steps involved in causal inference, including the identification of causal relationships, the estimation of causal effects, and the testing of assumptions.

DoWhy allows us to easily specify causal models using a high-level language that is based on the graphical model notation. It then automatically identifies the causal effect to be estimated and generates a corresponding statistical model. It supports a wide range of methods for causal inference, including regression, matching, weighting, and instrumental variables.

The main advantage of using DoWhy is that it allows researchers and data scientists to perform causal inference without having to manually specify and estimate causal models, which can be a time-consuming and error-prone process. By automating many of the steps involved in causal inference, DoWhy can help to make causal analysis more accessible to a wider range of users and enable more accurate and reliable causal inference.

Let us use the DoWhy library to calculate the causal effect for our dataset.

We are creating a causal model using the DoWhy library. The dataset smoking_francedata_copy is being used, concept_id_french as the treatment variable and total_yr_deaths_FRANCE as the outcome variable. The variable level_of_cause is included as a common cause of the treatment and outcome variables. No instrument variable is included in this model. The resulting model object can be used to estimate the causal effect of the treatment on the outcome using various methods provided by the DoWhy library.

We are then identifying the causal effect using the identify_effect method of the CausalModel class from the DoWhy library. The proceed_when_unidentifiable=True argument allows the method to proceed even when a causal effect cannot be identified, which is useful for exploring the data. Once the causal effect has been identified, we estimate it using the estimate_effect method with the identified estimand and the backdoor.linear_regression method for causal inference. This method estimates the causal effect using linear regression with the backdoor adjustment technique.

The backdoor criterion states that, in order to identify the causal effect of a treatment on an outcome, all backdoor paths between the treatment and the outcome must be blocked. A backdoor path is a path from the treatment to the outcome that contains at least one confounding variable.

This output is the result of a causal inference analysis using the backdoor propensity score matching method to estimate the causal effect of the concept of smoking (concept_id_french) on the total number of deaths in France (total_yr_deaths_FRANCE).

The identified estimand indicates that the analysis is estimating the non-parametric average treatment effect (ATE), which is the difference in the mean outcome between the treated and untreated groups in the population.

The realized estimand shows the model specification used to estimate the causal effect, which includes the covariates concept_id_french and level_of_cause in a linear regression model (total_yr_deaths_FRANCE~concept_id_french+level_of_cause).

The target units for the estimate are the average treatment effect (ATE), which represents the average causal effect of smoking on the total number of deaths in France across all individuals in the population.

The estimate value is the mean of the estimated causal effect, which in this case is 11379.99999999997. This means that smoking is estimated to cause an average increase of approximately 11,380 deaths per year in France.

REFERENCES

MIT License

All code in this notebook is available as open source through the MIT license.

All text and images are free to use under the Creative Commons Attribution 3.0 license. https://creativecommons.org/licenses/by/3.0/us/

These licenses let people distribute, remix, tweak, and build upon the work, even commercially, as long as they give credit for the original creation.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE