Performing Exploratory Data Analysis on Stroke Dataset via Python.

Published in

Geek Culture

9 min readMar 13, 2021

Introduction

A stroke occurs when a blood vessel in the brain ruptures and bleeds, or when there’s a blockage in the blood supply to the brain. The rupture or blockage prevents blood and oxygen from reaching the brain’s tissues.

This post aims to serve the following purposes:

To pinpoint the risk factors for stroke.
Describe techniques to discover whether a variable is a risk factor.
Visualize the findings.
Share insights.

This dataset (link here) is used to predict whether a patient is likely to get a stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

Data Analysis begins…

Step 1

Importing the necessary libraries:

import pandas as pd # For data manupulation.
import seaborn as sns # For visualization
import matplotlib.pyplot as plt # For visualization
import warnings # For suppressing warnings
warnings.filterwarnings("ignore")

Step 2

Loading data into a pandas dataframe and printing out its sample.

# Loading data into a pandas dataframe and printing out its sample.
strokes_data = pd.read_csv("healthcare-dataset-stroke-data.csv")
strokes_data.head(10)

Just by looking at the sample of the dataset, we can figure out the columns and the type of data that they contain.

Observation:

The id column is a unique identifier.
The dataset contains both categorical and numerical columns.

Categorical columns:

gender: Gender of the patient.
hypertension: whether the patient suffers from hypertension (1) or not (0).
heart_disease: whether the patient suffers from heart disease (1) or not (0).
ever_married: marital status of the patient if married (Yes) else (No).
work_type: The type of occupation of the patient.
Resident_Type: The type of residence of the patient.
smoking_status: How often does the patient smoke (if ever).

Numerical columns:

age: Age of the Patient
avg_glucose_level: Average Glucose Level of the patient.
bmi: body mass index of the patient.

Output Column:

Stroke: Whether the patient is likely to get a stroke (1) or not (0).

Step 3

Get the idea of the size of data points by printing its shape.

# Printing the shape of data to know its size.
strokes_data.shape # 5110 rows and 12 columns

Step 4

Generate the descriptive statistics

We can apply the “describe” command to generate the descriptive statistics of the dateset.

strokes_data.describe()

Figure 3:- Descriptive statistics of the dataset.

Observation:

There are some Null values in the bmi column. Since its count does not match with the total rows of the column.
The average age of the patient is 43 in the given dataset.
The average of hypertension column and heart_disease column is significantly lower than 0.5 (average of 0,1). That means that hardly any of the patients suffer from those in the given dataset.
The average glucose level is around 100 which can be considered healthy as well.
The mean of stroke has 1 as its max value and 0 as its min, 25th, 50th, and 75th percentile value. That means, mostly all the values in the column are 0.

Step 4

Taking care of NA values

Since there are some NA values in the given dataset for the bmi column, We would need to take care of them. One such strategy is to simply drop the rows that contain the null values. However, before that, let’s see whether the concerned rows contain any row that has stroke status as 1.

Figure 3:- Count of rows that have stroke as 0 and 1 respectively when bmi is null.

Since there are some rows in the dataset where bmi is null but stroke is 1, we will not be removing the rows but rather, impute it with the mean of the column.

mean_value = strokes_data['bmi'].mean()
strokes_data.fillna(mean_value, inplace = True)

Step 5

Visualizing the frequency of output column.

We are going to use a countplot for the same. Countplot is used to show the counts of observations in each categorical bin using bars.

A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.

# Count of Patients that suffer from stroke along with those that did not.
sns.set(rc={'figure.figsize':(18,10)})
seaborn_plot = sns.countplot(strokes_data['stroke'])
seaborn_plot.set_xlabel("Stroke",fontsize=20)
seaborn_plot.set_ylabel("Count of Patient",fontsize=20)

Observation:

There are 2 outcomes in this dataset: 0 and 1 for the likely hood of getting a stroke.
This is an imbalanced dataset since the number of patients that are likely to get a stroke is smaller when compared with the number of patients that did not.

Step 6

Studying the variables individually and gauging their impact on output column.

print(strokes_data['ever_married'].value_counts())
sns.countplot(strokes_data['ever_married'],hue = strokes_data['stroke'])

Figure 5:- Count of patients based on their marital status.

Observation:

Most of the patients are married in the given dataset.
The marital status by itself is not a significant factor in predicting the likely hood of a stroke.

print(strokes_data['gender'].value_counts())
sns.set(rc={'figure.figsize':(18,10)})
seaborn_plot = sns.countplot(strokes_data['gender'], hue = strokes_data['stroke'])
seaborn_plot.set_xlabel("gender",fontsize=20)
seaborn_plot.set_ylabel("Count of Patient",fontsize=20)

Figure 6:- Count of patients based on their gender status.

Observation:

The number of female gender patients is more than the number of male gender patients for both cases.
The Gender variable by itself is not enough to predict the likely hood of getting a stroke.

print(strokes_data['heart_disease'].value_counts())
sns.set(rc={'figure.figsize':(18,10)})
seaborn_plot = sns.countplot(strokes_data['heart_disease'], hue = strokes_data['stroke'])
seaborn_plot.set_xlabel("heart_disease",fontsize=20)
seaborn_plot.set_ylabel("Count of Patient",fontsize=20)

Figure 7:- Count of patients based on whether they suffer from heart_disease.

Observation:

The number of patients suffering from heart disease is fairly low when compared with patients who don’t suffer from one.

print(strokes_data['work_type'].value_counts())
sns.countplot(strokes_data['work_type'], hue = strokes_data['stroke'])

Figure 8:- Count of patients based on their profession.

Observation:

We don’t have much data on patients that never worked and the ones that we do have, never suffered from a stroke.
The children category also does not contain any patient that suffered from a stroke.
The above finding does not imply that the mentioned categories don’t suffer from stroke, it simply means that we have a small dataset. Of course, it is extremely rare for children to suffer from a stroke.
The work_type variable by itself will give us an under fit model.

print(strokes_data['Residence_type'].value_counts())
sns.set(rc={'figure.figsize':(18,10)})
seaborn_plot = sns.countplot(strokes_data['Residence_type'], hue = strokes_data['stroke'])
seaborn_plot.set_xlabel("Residence_type",fontsize=20)
seaborn_plot.set_ylabel("Count of Patient",fontsize=20)

Figure 9:- Count of patients based on their residence.

Observation:

A Comparable number of patients live in Urban and Rural regions.
The Residence_type column will also be not good enough by itself.

print(strokes_data['smoking_status'].value_counts())
sns.countplot(strokes_data['smoking_status'], hue = strokes_data['stroke'])

Figure 10:- Count of patients based on their smoking habits.

Observation:

smoking_status is also not a good indicator by itself of whether the patient is likely to get a stroke.

print(strokes_data['hypertension'].value_counts())
sns.set(rc={'figure.figsize':(18,10)})
seaborn_plot = sns.countplot(strokes_data['hypertension'], hue = strokes_data['stroke'])
seaborn_plot.set_xlabel("hypertension",fontsize=20)
seaborn_plot.set_ylabel("Count of Patient",fontsize=20)

Figure 11:- Count of patients based on whether they have hypertension.

Observation:

Hypertension is also by itself not a good enough variable.

Now that we have analyzed all the categorical variables, let’s look at the correlation of numeric variables.

We can do so with the help of a heatmap. A heatmap is a two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colors. The seaborn python package allows the creation of annotated heatmaps.

fig, ax = plt.subplots(figsize=(7, 7))heatmap = sns.heatmap(strokes_data[['age', 'avg_glucose_level', 'bmi']].corr(), vmax=1, annot=True,ax = ax)heatmap.set_title('Correlation Heatmap')

Figure 12:- Heatmap of numerical variables present in the dataset.

Observation:

There is some positive correlation between age and bmi.
Overall, all 3 variables are correlated positively with each other. However, age and bmi show a stronger correlation.

We are going to graph a scatter plot among the numeric variables and use color-coding to find if a combination of the used variables has any impact on the output class.

sns.set_style("whitegrid")
sns.FacetGrid(strokes_data, hue="stroke", height=5).map(plt.scatter, "age", "avg_glucose_level").add_legend()
plt.title('Age vs avg_glucose_level')
plt.show()

Figure 13:- Scatterplot of Age and avg_glucose_level

Observation:

Having more age increases the likely hood of getting a stroke.
We can see that almost all the yellow spots are after 40. However, it is worth noting that a lot of the blue spots are also present after 40.
The age variable will be useful when creating a model.

Seaborn distplot lets you show a histogram with a line on it. This can be shown in all kinds of variations. We use seaborn in combination with matplotlib, the Python plotting module. A distplot plots a univariate distribution of observations.

sns.FacetGrid(strokes_data, hue="stroke", height = 8).map(sns.distplot, "age").add_legend()
plt.title("Distplot for patients' age")
plt.show()

Observation:

Generally, higher age increases the likely hood of a stroke.

sns.FacetGrid(strokes_data, hue="stroke", height = 8).map(sns.distplot, "bmi").add_legend()
plt.title("Distplot for patients' bmi")
plt.show()

Observation:

The bmi column by itself cannot be used to predict the likely hood of a stroke.

sns.FacetGrid(strokes_data, hue="stroke", height = 8).map(sns.distplot, "avg_glucose_level").add_legend()
plt.title("Distplot for avg_glucose level")
plt.show()

Figure 16:- Distplot for avg_glucose_level of patient

Observation:

The avg_glucose_level column by itself cannot be used to predict the likely hood of a stroke.

A box and whisker plot is a way of summarizing a set of data measured on an interval scale. It is often used in explanatory data analysis. This type of graph is used to show the shape of the distribution, its central value, and its variability.

NOTE: In the plot below, a technique call inter-quartile range is used in plotting the whiskers. Whiskers in the plot below do not correspond to the min and max values.

sns.set(rc={'figure.figsize':(18,10)})
sns.boxplot(x='stroke',y='age', data=strokes_data)
seaborn_plot.set_xlabel("stroke",fontsize=20)
seaborn_plot.set_ylabel("age",fontsize=20)
plt.show()

Observation:

The 75th Percentile of patients who did not suffer from stroke is equivalent to the 25th percentile of patients who suffered from a stroke.
The 50th percentile and 75th percentile for the patients that are likely to get a stroke shares a smaller gap when compared with the 25th percentile and 50th percentile.

A violin plot is a method of plotting numeric data. It is similar to a box plot, with the addition of a rotated kernel density plot on each side.

Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator. Typically a violin plot will include all the data that is in a box plot: a marker for the median of the data; a box or marker indicating the interquartile range; and possibly all sample points, if the number of samples is not too high.

# Violin Plot for Age and Stroke
sns.set(rc={'figure.figsize':(18,10)})
seaborn_plot = sns.violinplot(x='stroke',y='age', data=strokes_data)
seaborn_plot.set_xlabel("Stroke",fontsize=20)
seaborn_plot.set_ylabel("Age of Patient",fontsize=20)