Crash course in Causality Worked Examples — Pima Indians Diabetics Dataset Causal Inference using OLS estimates

Cibaca Khandelwal
AI Skunks
Published in
4 min readApr 28, 2023

In the Pima Indians Diabetes Database, the columns correspond to the following variables:

  • Pregnancies: number of times pregnant
  • Glucose: plasma glucose concentration at 2 hours in an oral glucose tolerance test
  • BloodPressure: diastolic blood pressure (mm Hg)
  • SkinThickness: triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
  • Age: age (years)
  • Outcome: diabetes outcome (0 or 1)

To use the Y, D, X notation with this dataset, we need to choose a treatment variable, an outcome variable, and a set of covariates. For example, we could choose:

  • Treatment variable D: a binary variable indicating whether a patient is prescribed a medication for diabetes.
  • Outcome variable Y: a binary variable indicating whether a patient experiences a diabetes-related complication (such as hospitalization or amputation).
  • Covariates X: a set of variables that might influence both the treatment assignment and the outcome, such as age, BMI, and family history.

To represent these variables in the Y, D, X notation, we would set:

  • Y: the Outcome column
  • D: a new column that we create based on the Insulin column (for example, we might choose to treat patients with insulin levels above a certain threshold).
  • X: a matrix containing columns for the Pregnancies, Glucose, BloodPressure, SkinThickness, BMI, DiabetesPedigreeFunction, and Age variables.

Example of Casual Inference on PIMA Indian Diabetic Dataset

  1. Install the casual inference library
! pip install causalinference

2. Import libraries, Load and clean the Dataset

import pandas as pd
import numpy as np
from causalinference import CausalModel
# Load the dataseturl = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'df = pd.read_csv(url, header=None)df.columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

3. Lets see how the data looks like now

df.head()

4. Divide the data into treatment and control group

TREATMENT = 'Insulin'
OUTCOME = 'Outcome'
df.groupby(TREATMENT)[OUTCOME].describe()

5. DefineCasual Model

# Create the treatment variable D based on insulin levels
D = (df['Insulin'] > 100).astype(int)

# Set the outcome variable Y to the 'Outcome' column
Y = df['Outcome']

# Set the covariates X to all columns except for 'Outcome' and 'Insulin'
X = df.drop(['Outcome', 'Insulin'], axis=1)

# Create the causal model and estimate the treatment effect
causal_model = CausalModel(Y, D, X)

6. Run Casual Model

The CausalModel class uses OLS regression to estimate the treatment effect. OLS is a common approach to estimate the causal effect of a treatment in a linear regression framework.

In particular, the est_via_ols() method of the CausalModel class fits a linear regression model of the form:

Y = b0 + b1*D + b2*X2 + ... + bn*Xn + e

where Y is the outcome variable, D is the treatment variable, and X2, ..., Xn are the covariates. The coefficients b1 estimate the treatment effect, and the remaining coefficients estimate the effects of the covariates on the outcome.

The est_via_ols() method uses this linear regression model to estimate the treatment effect by comparing the mean outcomes of the treatment and control groups, after controlling for the covariates. The resulting treatment effect estimate is a weighted average of the coefficients on the treatment variable in the regression model.

It is important to note that OLS regression assumes linearity, independence, and homoscedasticity of the error terms, among other assumptions. Therefore, the validity of the causal inference results depends on the accuracy of these assumptions. If these assumptions are not met, alternative methods, such as propensity score matching or instrumental variables regression, may be more appropriate.

causal_model.est_via_ols(adj=1)
print(causal_model.summary_stats)
print(causal_model.estimates)

In this example, we create the treatment variable D based on whether a patient's Insulin level is above 100. We set the outcome variable Y to the Outcome column and the covariates X to all other columns except for Insulin and Outcome. We then create a CausalModel object and estimate the treatment effect using ordinary least squares (OLS) regression. Finally, we print the summary statistics and treatment effect estimates.

Note that this is just one possible way to implement a causal inference model using the Pima Indians Diabetes Database. The specific choice of variables and their roles in the causal model will depend on the research question of interest.

Conclusion

The output of the causal model will include various summary statistics, such as the means and standard deviations of the treatment and control groups, as well as the treatment effect estimate and its standard error.

The treatment effect estimate represents the causal effect of the insulin levels on the diabetes outcome. A positive treatment effect indicates that higher insulin levels are associated with an increased risk of diabetes, while a negative treatment effect indicates the opposite.

Based on the output of the causal model, we can draw conclusions about the direction and significance of the treatment effect. However, it is important to note that causal inference is always subject to various assumptions and limitations, and therefore the conclusions should be interpreted with caution and in the context of the research question and data at hand.

--

--

Cibaca Khandelwal
AI Skunks

Tech enthusiast at the nexus of Cloud ☁️, Software 💻, and Machine Learning 🤖, shaping innovation through code and algorithms.