Understanding Heart Stroke with Machine Learning

Predicting Stroke outcomes with Machine Learning

Published in

The Deep Hub

4 min readFeb 7, 2024

Introduction

Is there hope for a healthier and longer life with Machine Learning? Absolutely! In today’s world, health outcomes can be predicted using Data Science tools with the right amount of data. One major issue we’re delving into is Heart Stroke, ranked 2nd globally as a leading cause of death. We’ll break down various factors affecting stroke and analyze each factor in detail.

Context

According to the World Health Organization (WHO)[2], stroke is the second leading cause of death worldwide, responsible for approximately 11% of total deaths. A stroke occurs when blood flow to an area of the brain is cut off, leading to brain cell damage and potentially life-altering consequences. Early detection and prevention are crucial in managing stroke risk and improving patient outcomes.

Data Attributes

It’s important to understand the attributes we are using. The dataset[1] we’ll be using contains information about patients and various factors that may influence their likelihood of having a stroke, such as age, gender, presence of hypertension, heart disease, average glucose level, body mass index (BMI), smoking status, and whether the patient has had a stroke.

Analysis of Data

To gain insights from the dataset, we’ll first explore the distribution of age among patients using a histogram. This will help us understand the age demographics of the patients in the dataset.

Distribution of Age with respect to Stroke

With increasing age, the risk of stroke increases, growing more prominent from 50 years.

We’ll also create a correlation matrix to identify relationships between different variables, such as age, hypertension, and stroke.

Additionally, we’ll use a pairplot to visualize relationships between multiple features at once.

Work Categorization

According to the analysis, individuals in the ‘Private’ work type category show a higher likelihood of stroke compared to those in the ‘Self-employed’, ‘Govt_job’, ‘Children’, and ‘Never_worked’ categories, as depicted in the bar chart with ‘Private’ having the highest proportion of stroke cases among work types.

Engineering Data

The data contains 4861 rows as no stroke and 249 rows as stroke, since the data is imbalanced (0 for no stroke, 1 for stroke), we’ll perform oversampling to address class imbalance, ensuring that the dataset is representative and unbiased.

We’ll encode categorical variables and use the RandomOverSampler to balance the dataset, ensuring that both stroke and non-stroke cases are adequately represented.

After oversamping no.of stroke and non-stroke count are equal

Model Selection and Training

Random Forest was chosen for its ability to handle complex datasets with multiple features and its robust performance in classification tasks, making it well-suited for predicting stroke risk based on various patient factors.

To predict stroke risk, we’ll use the Random Forest classifier and perform grid search to find the optimal hyperparameters

Random Forest Classifier (with Best Parameters) Accuracy: 0.9928020565552699

Confusion Matrix (Random Forest Classifier with Best Parameters):
[[961  14]
 [  0 970]]

Breaking down the values

961 instances were correctly classified as no stroke.
14 instances were incorrectly classified as having a stroke when they didn’t.
0 instances were incorrectly classified as not having a stroke when they actually did.
970 instances were correctly classified as having a stroke.

In this report, for both ‘0’ and ‘1’ classes, the F1-score is very high (close to 0.99), indicating a good balance between precision and recall.

Overall, the Random Forest classifier with the best parameters achieved high accuracy in classifying both positive and negative cases. However, it’s important to consider the specific context and consequences of false positives and false negatives in the given classification problem.

Conclusion

With the help of Machine Learning Algorithms, the likelihood of health issues like heart stroke can be predicted, and precautionary measures can be taken accordingly. This small dataset showcases the potential of machine learning in improving health outcomes, but imagine the impact on a larger scale and with more data and more advanced algorithms.

ML can revolutionize healthcare, leading to longer and healthier lives for human beings. By leveraging technology to detect health risks early and implementing preventive measures, we can empower individuals to take control of their health and live happier, more fulfilling lives.

References

[1] Kaggle Dataset : https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

[2] WHO : https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death