Beyond Correlation: Causal Inference

Henok Tilaye
7 min readJul 2, 2022

--

Causal Graph from features in the Breast Cancer dataset

Literature review

What is causal inference?

Causal inference is the process of answering questions with a “why” nature. this is to mean determining reasons for why a variable exists or is somehow manipulated. This relation is far stronger than correlation. Correlations are just the co-evolution of variables for a long period of time. But in the case of causality, co-evolution alone is not enough.

Let’s take the following example: A school wants to increase its students’ exam grades. And since we are in the age of technology, they want to provide each student with a tablet computer. Now, just looking at the grades for students with and without tablets we might see an increase. But that doesn’t mean it is caused by the tablets the student’s were given.

There is one more variable that we are not considering here. And that is, the schools that are able to afford to buy each of their students tablet’s are financially strong enough to provide all required facilities and hire the best teachers. This is called a confounding variable. A third variable that is affecting the first two variables, students grades and them owning tablets. In order to make this experiment legitimate, we need to keep this and all confounding variables constant.

Methodologies

Generally speaking, we can classify methodologies related to causal inference into three categories:

  • Experiment/Field study/Randomized-Controlled Trials: Actively divide subjects into control and treatment groups randomly to evaluate the causal link between treatment and outcome of interest. Randomness is key to ensure trustworthy results.
  • Quasi-Experiment: Causal inference based on observational data. Since there are intentional or unintentional randomness in the data, researchers need to choose methodologies carefully with specific assumptions.
  • Natural Experiment: Empirical studies that expose subjects to treatment and control groups taking advantage of natural events, like lottery draw, birth of the month, immigration law, etc. Since the randomness mainly comes from the natural events that have already happened, researchers can analyze the causal link with observational data.

Overview of the data source and formats

The data provided is from a researcher on Breast Cancer Diagnosis and Prognosis Via Linear Programming. I downloaded the dataset from Kaggle. The features in the columns are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Attribute Information:

1. ID number

2. Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

a. radius (mean of distances from center to points on the perimeter)

b. texture (standard deviation of gray-scale values)

c. perimeter

d. area

e. smoothness (local variation in radius lengths)

f. compactness (perimeter² / area — 1.0)

g. concavity (severity of concave portions of the contour)

h. concave points (number of concave portions of the contour)

i. symmetry

j. fractal dimension (“coastline approximation” — 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius. All feature values are recorded with four significant digits.

The data had a column with null values. It also has an ID column that doesn’t contribute valuable information to the problem at hand. For this reason I have removed these two columns. Afterwards, there were no null values in the data. The next thing I did was looking at the distribution of the diagnosis in the dataset.

Next up was bi-variate and multivariate analysis of the different variables. For the bi-variate analysis, I used pair plots. But, since the number of columns was too big, I had to create the plots in groups. For the multi-variate analysis, I calculated the correlation of the variables.

But since we are not interested in which variable is correlated with which one, I removed one for every pair of correlated variables that had a correlation index of greater than 90. Another reason for doing this is to reduce the number of columns for computational efficiency. Also by keeping one of the correlated variables, I ensure that I am not losing a lot of information.

This is the correlation heat map for the variables I am left with.

Up To now, I have removed columns that don’t contribute a lot to the analysis and I am left with 16 columns from the original 32. From now on, I will be using this version of the dataset.

Techniques used to perform causal inference

Common frameworks for causal inference include the causal pie model (component-cause), Pearl’s structural causal model (causal diagram + do-calculus), structural equation modeling, and Rubin causal model (potential-outcome), which are often used in areas such as social sciences and epidemiology.

The causal pie model is used in epidemiology to figure out the causes for different diseases. As its name suggests, It uses pie charts. Each slice of the pie will be the different contributing causes for an event or a theoretical causal mechanism. This is called a sufficient cause. These slices are made up of component factors, otherwise known as component causes. A component cause that appears in every pie is called a necessary cause as the outcome cannot occur without it.

Pearl’s structural causal model uses a causal diagram to visualize causal relationships. This diagram is a directed acyclic graph that has events (causes/effects) as vertices and their causal relationships depicted as the edges with directions. This graph can first be generated using expertise in the phenomena being discussed. There are also algorithms for generating the diagrams from the available data.

For this project, I used the causal-graph to visualize the causal relationship between the above stated variables in the data. I used the Causalnex library to learn the graph structure from the data. If it were possible to get an expert in the matter, I could have created the graph manually based on the experts opinion. But since there is none for now, I have generated it based on the data.

I first split the data into a training and testing set. Then I used versions of the training set to learn the causal-graph structure. I started with all the columns to get a baseline graph. Then, I used slices of the training data to check the stability of the baseline graph. The idea was to use gradually increasing portions of the training data, then generating a graph using the portions, and finally checking its similarity with the baseline graph. This is calculated as the Jaccard index or Intersection over Union (IoU) of the graph edges. The following is the baseline causal-graph.

Baseline Causal-Graph with all columns of training data.

The portion sizes I used were 40% and 70%. And when calculating the IoU between each, I got the following.

IoU between the causal-graphs

The above values are not that good. My opinion is that there are too many columns. So, the next step is to keep only the columns that had a direct relation to the “diagnosis” column in all three causal-graphs which are shown below.

Columns I kept

The graphs generated with only the above columns were more cleaner, and they had better IoU value among themselves. But there is still room for improvement. Below is the new baseline generated using the above columns.

New-Baseline causal-graph with a subset of the original columns
IoU values of the causal-graphs with the subset data

The next step was to train a Bayesian Network model using these graphs. But before that I discritized my columns using the automated discritizer available in the causalnex library. This automatically creates bins for where each value in a column would fall in and only return the index of the bin. For a better explanation read this blog.

After training the Bayesian models on these new smaller graphs, I evaluated their performances on the test dataset I created earlier. The metric results were very close. This confirms our previous point about the graphs stability.

Metrics for the model with the new baseline causal-graph
Metrics for the model with the new causal-graph using 40% of the data

Conclusion

This project was my first introduction to the world of causality. And as such, I might not have all the concepts right. Feel free to drop me a comment about anything that is not clear. I will be updating the blog if my understand changes. You can find all the code in my github.

--

--