Cardiovascular Risk Factors — Gaining Insight from a Kaggle Dataset With Python

9 min readDec 21, 2023

Since my previous attempt (my first) attempt at exploratory data analysis, I have gone through other people’s work on the last dataset I worked on and all I have to say is this — I have a long way to go. However, I am undeterred as I have realized that it takes practice and consistency to achieve mastery at any craft.

This time, I have decided to work on a dataset in my field of learning — healthcare — which is on cardiovascular diseases. I believe that my domain knowledge will help me make a better analysis and gain interesting and meaningful insights from the dataset. Happy reading.

Background

The importance of artificial intelligence (AI) in healthcare cannot be overemphasized. AI methods such as machine learning (ML), Deep learning and Natural language Processing (NLP) have been applied in research to develop algorithms for managing health conditions and developing models for classifying diseases and predicting disease outcomes.

One very practical example is the development of cardiovascular risk assessment tools. Several scientific associations/organizations and researchers have come up with various cardiovascular risk assessment tools to help predict the risk of developing cardiovascular disease in any patient based on known cardiovascular risk factors.

To develop an accurate model, exploratory data analysis must first be performed as this helps to prepare the data, better understand the data, test the basic assumptions of the model, understand the strength of the relationship between the various variables and overall help build an accurate model.

For this study, I am performing an exploratory data analysis on a Kaggle dataset on Cardiovascular diseases with the aim of discovering meaningful insights and hopefully, in the nearest future, building a model based on the dataset.

Problem Statement

Studies have shown several risk factors associated with cardiovascular diseases for example hypertension, smoking, obesity, etc. However, to develop a meaningful model, the level of association and degree of relationship between these variables must be determined and these varies with population due to genetic and environmental factors. Hence, population specific studies must be performed in order to develop models that are right and can accurately predict disease for a given population.

Although, this is a real dataset, it is a practice work and little is known about the conditions in which the data was collected or the values used in classifying the various variables.

Objectives

Understanding the variables in this dataset and their statistical properties.
To determine the prevalence of hypertension and to classify hypertension into various types using various methods such as degree of blood pressure and types of blood pressure (systolic and diastolic blood pressure) and the association with presence of cardiovascular diseases
To determine the prevalence of cardiovascular diseases and other factors associated with the presence of cardiovascular diseases.

Data Cleaning

First, I had to import important tools needed for the analysis as well as the dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

df = pd.read_csv('health_data.csv')
df.info

The dataset had 14 variables and 70,000 observations.

From the info, it was obvious there were no missing data, however, I had to see how many unique values each categorical variable had. I also had to visualize the distribution of some of the variables to decide which statistical tool I had to use to clean the data and lastly, I dropped some of the columns that were not useful for my study objectives.

df.nunique()

The categorical variabe and the number of their unique values include gender (2), cholesterol level (3), glucose level (3), smoking (2), alcohol intake (2), level of physica activity (2), Cardiovascular disease (2). Except for gener which was male or femle, the other categorical variables which had 2 unique values coded for yes/no while the ones which had 3 unique values coded for level (normal, high and very high).

I dropped the first two columns as they were not helpful in my analysis and seemed to be index numbers.

df.drop(['Unnamed: 0', 'id'], axis=1, inplace=True)

From the statistical description and some visualization (histogram with normal curve) of the dataset, I could observe that the numerical variables were mosty normally distributed. However, I noticed some outliers and incorrect inputs.

Normal distribution of weight, though skewed with outliers

df.describe().T

The minimum systolic and diastolic blood pressure were negative values whch were impossible, likewise the maximum value (tens of thousands mmHg). Same for height and weight. I also noticed the age seems to be in days which was harder to relate to.

Feature Engineering

My domain knowledge in healthcare helped greatly in dealing with outliers and incorrect inputs.

In eliminating outliers, I employed the use of mean and standard deviation since most variables were normally distributed. However, I employed the use of quantiles to get rid of incorrect input and outliers for systolic and diastolic blood pressure as their distributions were greatly skewed.

ht_std = df.height.std()
wt_std = df.weight.std()
ht_mean = df.height.mean()
wt_mean = df.weight.mean()

ht_lo = ht_mean - 3*ht_std
ht_hi = ht_mean + 3*ht_std

wt_lo = wt_mean - 3*wt_std
wt_hi = wt_mean + 3*wt_std

sys_lo = df.ap_hi.quantile(0.01)
sys_hi = df.ap_hi.quantile(0.99)
dia_lo = df.ap_lo.quantile(0.02)
dia_hi = df.ap_lo.quantile(0.98)

def outlier(data, column, lower_border, upper_border):
    result = (data[column] > lower_border) & (data[column] < upper_border)
    return result

ht_df = outlier(df, 'height', ht_lo, ht_hi)
wt_df = outlier(df, 'weight', wt_lo, wt_hi)
sys_df = outlier(df, 'ap_hi', sys_lo, sys_hi)
dia_df = outlier(df, 'ap_lo', dia_lo, dia_hi)

I made use of a lot of functions which were very helpful as I did not have to rewrite same lines of code multiple times which was something I did not do in my last attempt at EDA. I discovered this from going through the works of other data analysts.

# Creating a dataframe that fits into the condition given as above
df_logic = ht_df & wt_df & sys_df & dia_df
df = df[df_logic]

Now, the data looks much better and the outliers are gone at least to a reasonable extent.

Then I went ahead to create some new variables based on the old ones such as age in years, age categories (in bins of ten years), body mass index (BMI), body mass index categories (normal, underweight, overweight and obese), presence of hypertension (0=no, 1=yes) and lastly, hypertension type based on severity and type.

# I added a column to covert the age in days to years which is easier to relate with.
df['age_y'] = df.age//365.25

# a new column for age categories in bins of 10 years
df['age_recoded'] = pd.cut(df.age_y,
                          bins=[20,30,40,50,60,70],
                          labels=[1,2,3,4,5])

# bmi
df['bmi'] = df.weight/(df.height/100)**2

#BMI categories
df['bmi_cat'] = pd.cut(x=df.bmi,
      bins=[0,18.5,25.0,30.0,60.0],
      labels=['underweight', 'normal', 'overweight', 'obese'])

# presence of hypertension
hyp = (df.ap_hi>=140) | (df.ap_lo>=90) #i.e systolic above 140mmHg and diastolic above 90mmHg
df['hypertension'] = pd.to_numeric(hyp).astype(int) # Converting the logic (i.e true/false) from above to 0 and 1

# Hypertension type
dia_hyp = ~(df.ap_hi>=140) & (df.ap_lo>=90)
sys_hyp = (df.ap_hi>=140) & ~(df.ap_lo>=90)
both = (df.ap_hi>=140) & (df.ap_lo>=90)
df['hypertension_type'] = np.nan
df.loc[dia_hyp, 'hypertension_type'] = 'isolated diastolic'
df.loc[sys_hyp, 'hypertension_type'] = 'isolated systolic'
df.loc[both, 'hypertension_type'] = 'both'

#Hypertension severity
pre_hypertension = ((df.ap_hi >= 130) | (df.ap_lo>=85)) & ~grade1 & ~grade2 
grade1 = ((df.ap_hi>=140) | (df.ap_lo>=90)) & ~((df.ap_hi>=160) | (df.ap_lo>=100))
grade2 = (df.ap_hi>=160) | (df.ap_lo>=100)
df['hypertension_severity'] = np.nan
df.loc[pre_hypertension, 'hypertension_severity'] = 0
df.loc[grade1, 'hypertension_severity'] = 1
df.loc[grade2, 'hypertension_severity'] = 2

# Let's take a look at the new columns:

df[['age', 'age_recoded', 'bmi', 'bmi_cat', 'hypertension', 'hypertension_type', 'hypertension_severity']][df.ap_hi>=130].sample(10)
# I added a condition of a minimum systolic bp of 130mmHg to limit the number of missing values (NaNs) in the columns for hypertension type and severity in the row samples requested.

Now that our data is ready for exploratory data analysis, let’s dive in:

Univariate Analysis

For numerical variables such as age (in years), BMI, diastolic and systolic BP, I will determine their statistical properties such as mean ± standard deviation (SD) as measures of central tendency and to dispersion as well as minimum and maximum values. I will also be visualizing using histograms and boxplots.

For categorial variables, I will be determining frequency and proportion. Visualization will be with the aid of pie charts and/or bar charts as appropriate.

Numerical Variables

num_cols = ['age_y', 'bmi', 'ap_hi', 'ap_lo']
col_dict = {'age_y': 'Age in Years',
            'bmi': 'Body Mass Index',
            'ap_hi': 'Systolic BP',
           'ap_lo': 'Diastolic BP'}

for col in num_cols:
    print(col_dict[col], ':')
    print('Min:', df[col].min(),
         '\nMax:', df[col].min(),
         '\nMean ± SD:', round(df[col].mean(), 2), '±', round(df[col].std(), 2))
    plt.figure(figsize = (15, 4))
    plt.subplot(1, 2, 1)
    df[col].hist(grid=False)
    plt.ylabel('count')
    plt.subplot(1, 2, 2)
    sns.boxplot(x=df[col])
    plt.show()

Age in Years :
Min: 29.0 
Max: 64.0 
Mean ± SD: 52.89 ± 6.73

Body Mass Index :
Min: 12.25
Max: 58.02 
Mean ± SD: 27.31 ± 4.82

The BMI has a lot of outliers at the upper end — some morbidly obese individuals.

Diastolic BP :
Min: 61.0 
Max: 109.0 
Mean ± SD: 81.78 ± 7.71

Diastolic BP is greatly skewed, median = 25th percentile

Systolic BP :
Min: 93.0 
Max: 179.0 
Mean ± SD: 126.66 ± 14.29

Systolic BP is greatly skewed, median = 25th percentile

Categorical Variables

Most of the observations were middle aged.

Determining the prevalence of hypertension and cardiovascular disease:

print('Prevalence of Hypertension:', round(df.hypertension.mean(), 2))
print('Prevalence of Cardiovascular Diseases:', round(df.cardio.mean(), 2))

Prevalence of Hypertension: 0.34
Prevalence of Cardiovascular Diseases: 0.5

The dataset actually contained an equal number of people with and without cardiovascular diseases.

Bivariate Analysis

For bivariate analysis, I employed the use of chi-square to determine the level of significance of the relationship between my independents variables and the dependent variable (Cardiovascular diseases). The independent variables include age, gender, cholesterol level, glucose level, alcohol intake, smoking, level of physical activity and hypertension. Level of significance was slated at a cut-off p-value of <0.05. I added a visual aid using a bar-chart showing the proportion of observations with cardiovascular diseases in the various categories of the variables.

fig, axes = plt.subplots(6, 2, figsize = (30,70))
fig.suptitle('Bivariate Analysis for Categorical Variables',fontsize=50)
sns.set(font_scale=2)
x,y = 0,0

for variable in ['age_recoded', 'gender', 'bmi_cat', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'hypertension', 'hypertension_type', 'hypertension_severity']:
    if x == 0:
        df.groupby(variable)['cardio'].mean().sort_values(ascending=False).plot.bar(ax=axes[y][x], fontsize=25, color = 'lightcoral')
        axes[y][x].set_title(f'{index_dict[variable]} Vs Cardiovascular Disease', fontsize=30)
        x = 1
    elif x == 1:
        df.groupby(variable)['cardio'].mean().sort_values(ascending=False).plot.bar(ax=axes[y][x], fontsize=25, color='lightcoral')
        axes[y][x].set_title(f'{index_dict[variable]} Vs Cardiovascular Disease', fontsize=30)
        x = 0
        y += 1
    
axes[1][0].tick_params(labelrotation=15);
axes[4][1].tick_params(labelrotation=15);
plt.subplots_adjust(hspace=0.5)
plt.subplots_adjust(wspace=0.25)
sns.despine()

Bar-chart showing the proportion of individuals with cardiovascular diseases in the different variable categories

# function to return the expected count
def chi_square(row, col, data):
    cross_tab = pd.crosstab(data[row], data[col])
    value, p_value, df, expected_count = stats.chi2_contingency(cross_tab)
    return cross_tab

# function to return p value and chi square coefficient
def chi_square1(row, col, data):
    cross_tab = pd.crosstab(data[row], data[col])
    value, p_value, df, expected_count = stats.chi2_contingency(cross_tab)
    if p_value < 0.01:
        return(index_dict[row],  '< 0.01', value, '\n\n')
    else:
        return(index_dict[row], round(p_value, 2), value, '\n\n')

# function to return the observed count (normalized)
def chi_square2(row, col, data):
    cross_tab = pd.crosstab(data[row], data[col], normalize='index')
    value, p_value, df, expected_count = stats.chi2_contingency(cross_tab)
    return cross_tab

for row in ['age_recoded', 'gender', 'bmi_cat', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'hypertension', 'hypertension_type', 'hypertension_severity']:
    print(chi_square(row, 'cardio', df), '\n', chi_square2(row, 'cardio', df), chi_square1(row, 'cardio', df))

The outputs were computed in a table using microsoft word. Out of the 11 variables tested, only gender was not significantly associated with presence or absence of cardiovascular diseases.

I also went ahead to determine the correlation coefficient of the statistically significant relationships, this was visualized with the aid of a heatmap.

There is a moderate positive correlation between hypertension and cardiovascular disease and it is obvious that systolic blood pressure contributes more to the risk than the diastolic blood pressure.

There is a weak correlation between age as well as cholesterol level and cardiovascular disease. Body mass index had a very weak degree of correlation with cardiovascular disease.

The other variables showed very weak to almost insignificant correlation, however, it is interesting that the correlation was negative for alcohol intake and smoking, even though the relationship is almost insignificant.

Summary

Most of the continuous variables were normally distributed except for the diastolic and systolic blood pressure that were significantly skewed.
Most of the observations in the dataset were in the middle ages.
The prevalence of hypertension is about 34% and hypertension was the most significantly associated variable with cardiovascular diseases.
The dataset was such that there were equal observations of individuals living with cardiovascular diseases and individuals living without cardiovascular diseases.
All the independent variables except gender showed statistically significant (p-value < 0.05) association with cardiovascular diseases.
Out of the statistically significant associations, only hypertension showed a moderate degree of correlation with cardiovascular diseases with the systolic blood pressure contributing more to the risk than the diastolic blood pressure.
Cholesterol level, age and BMI showed weak degrees of correlation with cardiovascular diseases. Other variables showed very weak to almost insignificant degrees of correlation.

Source of the Dataset

https://www.kaggle.com/datasets/akshatshaw7/cardiovascular-disease-dataset

Cardiovascular Risk Factors — Gaining Insight from a Kaggle Dataset With Python

Univariate Analysis

Bivariate Analysis

Summary

Written by Olusola Agbana