STROKE PREDICTION ANALYSIS (ML PROJECT)

17 min readJul 8, 2023

FINAL PROJECT OF SAMPLING BY PACMANN — STROKE PREDICTION ANALYSIS

Case

According to WHO, stroke is the second leading cause of global death, responsible for approximately 11% of total deaths worldwide. The negative impact of stroke increases the need for innovative approaches to diagnosis and management of the disease. Opportunities to improve the accuracy of stroke diagnosis and potential risk assessment can be explored through analysis of patients’ medical data.

This report presents an analysis of the relationships between various factors that may influence the likelihood of a person developing stroke. The input features include gender, age, history of hypertension and heart attack, marital status, type of occupation, type of residence, average blood glucose level, BMI, and smoking status. Several classification prediction models, including KNN, Logistic Regression, Decision Tree, and Random Forest, are used. The final output of this study is an analysis of which features are correlated with an increased risk of stroke and a predictive model that can predict the potential of an individual developing stroke. It is hoped that this research will assist medical professionals in determining appropriate treatment for patients.

2. Dataset & Others

I. Dataset Variable

a. id: unique identifier

b. gender: “Male”, “Female”, or “Other”

c. age: age of the patient

d. hypertension:

“No” if the patient doesn’t have hypertension Normal with systolic blood pressure < 120 mmHg and diastolic < 80 mmHg Pre-hypertension with systolic blood pressure 120–139 and diastolic 80–89)
“Yes” if the patient has hypertension Hypertension stage 1 with systolic blood pressure 140–159 mmHg and diastolic 90–99 mmHg Hypertension stage 2 with systolic blood pressure > 160 mmHg and diastolic > 100 mmHg

e. heart_disease:

“No” if the patient has never had heart disease,
“Yes” if the patient has had heart disease

f. ever_married: “No” or “Yes”

g. work_type: “children”, “Govt_jov”, “Never_worked”, “Private”, or “Self-employed”

h. Residence_type: “Rural” or “Urban”

i. avg_glucose_level: average glucose level in blood (mg/dL)

j. bmi: body mass index Underweight Normal limit Overweight Pre-obese Obese I Obese II

k. smoking_status: “formerly smoked”, “never smoked”, “smokes”, or “Unknown”*

l. stroke: 1 if the patient had a stroke or 0 if not

Note: “Unknown” in smoking_status means that the information is unavailable for this patient

II. Description

In general, there are 12 features/columns in the dataset and 5,256 observation data. After checking, there were 146 duplicate observation data, which were dropped resulting in 5,110 observation data. There are 3,832 training data, 5 k-Fold for validation data, and 1,278 test data. Next, data preprocessing was performed by splitting the input and output, separating the “stroke” column as the output variable y, and the rest as the x column. Each stage was made in the form of a function to be performed repeatedly.

Next, a check was carried out on the unique values of each feature, namely gender: 3, age: 104, hypertension: 2, heart_disease: 2, ever_married: 2, work_type: 5, Residence_type: 2, avg_glucose_level: 3,979, bmi: 418, smoking_status: 4. The id feature was dropped because it was a unique value that did not represent anything. The gender feature was compressed into 2 unique values where the “others” gender was included in the “female” category according to the mode of the data. Furthermore, a train-test split was carried out to avoid overfitting the training data, where the test data will become future data. The test size was given as 0.25 with the stratify method because the data is imbalanced.

III. Exploratory Data Analysis

After splitting the data, Exploratory Data Analysis was conducted on the dataset. Skewness was checked for the numeric features including age (-0.129150), avg_glucose_level (1.583383), and bmi (1.018919). The number of stroke patients was 187 compared to 3645 non-stroke patients (out of total training data), thus categorized as imbalanced data. The dominant gender was Female (2,266) or 59.13% of the total sample compared to Male (1,566) individuals. Distribution plotting was performed on the variables age, bmi, and avg_glucose_level to see the probability density function and compared to the stroke and non-stroke categories.

Meanwhile, for other categorical features such as hypertension, heart disease, marital status, work type, residential type, smoking status, and gender, plotting was done in the following figure.

Next, checking for outliers on numerical variables was performed and the following boxplot was obtained.

Several outliers were found in the avg_glucose_level for non-stroke patients, where stroke patients tend to have higher avg_glucose_level than non-stroke patients. Meanwhile, for bmi, the pdf and boxplot show mean and median that are close to each other. The age variable represents significantly different means and medians between stroke and non-stroke patients. Hypothesis testing was conducted for numerical variables to check the equality of means between stroke and non-stroke patients. The results of the hypothesis testing showed that the mean of avg_glucose_level and age are significant compared to stroke and non-stroke, compared to bmi, which has a mean that is not significant enough.

IV. Data Imputation (Preprocessing)

At this stage, features are divided into numerical and categorical. There are 3 features that have NaN values, namely work_type, Residence_type, and bmi. For bmi, which is a numerical category, imputation is done using SimpleImputer from Sklearn with the median imputation strategy. Meanwhile, for the work_type and Residence_type features, the “Empty” category is input for NaN values.

After there are no more missing values, preprocessing is done for categorical variables using One Hot Encoding and Label Encoding. OHE is applied to features that have more than 2 unique values (work_type, smoking_status), while LE is implemented on features that have 2 unique values (gender, hypertension, heart_disease, ever_married, Residence_type).

Then, the features are merged back using concat. After ensuring that all features have been converted to numbers, variable standardization is carried out because the method used for Machine Learning is distance-based and not robust to wide range data, such as KNN and Logistic Regression for Ridge and Lasso. Standardization is done using StandardScaler from SKLearn. The resulting dataset from standardization is shown below.

V. Sampling

The stroke and non-stroke data are imbalanced, so to experiment with model performance, the data samples were configured using Random Under Sampling (374 observations), Random Over Sampling (7290 observations), and Oversampling (SMOTE) (7290 observations).

4. Method

I. K-Nearest Neighbor (KNN)

This algorithm estimates point-based values using nearest neighbors. A number of k observations that have the closest distance to a point, say, X. This method can be used for both regression and classification prediction. For regression cases, the mean/average of the target values is used, while for classification, the majority vote is used.

The steps involved in KNN are, first, select an observation point where a plot has been made on a Cartesian plane for two features. Next, determine the value of k or the number of nearest neighbors from the observation point, for example, a value of k of 3 is taken. Then, calculate the average/mean value (regression) and perform majority vote (classification) for each point. The nearest neighbors are calculated based on the closest distance to the observation point. Calculating the distance from the observation point can be done using several methods, such as Euclidean & Manhattan according to the following equations.

The optimal value of k depends on the data. Cross-validation can be used to compare k values. A smaller k increases the potential for overfitting/reducing bias in the data, thus requiring more model complexity. A larger k decreases sensitivity to noise in the data, making the model simpler but increasing bias in the model, known as the Bias-Variance Tradeoff.

K-nearest neighbor model on observation point x may become inefficient if it has high dimensional data p. To solve this issue, feature selection can be applied, using a more complex distance function, and using a parametric model such as Linear Regression or Logistic Regression.

II. Logistic Regression

The algorithm is a derivative of the linear regression algorithm. When using linear regression, we can model values larger than 0 and 1. Rather than using those values, we can limit the output of the linear regression algorithm by upper-bounding it to 1 and lower-bounding it to 0. This algorithm is called logistic regression.

The first step is assuming that the target output is continuous. Next, we can estimate linear regression using the following function:

Next, we can find the minimal value of w that minimizes the following function.

We can use the result of the regression function for classifying new data with the following configuration:

What is the way to get the best decision boundary? With the linear regression function mentioned above, project each data point onto the w line. Then, create a probability density function (pdf) for each class. There is a possibility of overlapping data.

We need to find a line that can maximize the separation projection on each class. The optimal decision depends on the posterior probability of the class with the following formula.

We don’t know the value of P(x|y). We can estimate it using the following function.

Where,

g(x) is a function that transforms the linear function into probabilities.

We can use the old linear regression function and input it into the sigmoid function.

Decision boundary when probability = 0.5

For logistic regression models, there is no closed-form solution like OLS. We can use Gradient Descent to obtain the solution for the equation. Regularization can be performed using L1 norm and L2 norm.

Here is the regularized logistic regression similar to linear regression:

Regularization can make the weight coefficients of linear regression decrease towards zero.

III. Support Vector Machine

This algorithm performs data separation using a hyperplane. For one-dimensional data, the hyperplane formed is a point. For two-dimensional data, it will form a line. For data with >3 dimensions, it will form a hyperplane that cannot be visualized. We can create hyperplanes indefinitely and determine the hyperplane that maximizes the margin between classes.

Since the data is not completely separable, a soft margin classifier can be used, which is represented by the slack value ξ. A value of 1 > ξ > 0 indicates that the point is on the margin and is correctly positioned on the hyperplane. A value of ξ < 1 indicates that there is a misclassification of the point

The value of C is a hyperparameter that determines the margin, where a small C value has a large margin while a large C value indicates a small margin. If the linear boundary cannot effectively separate the data, a new variable in the form of a kernel function can be added. Examples of kernel functions include linear function, D-th degree polynomial, and Radial Basis Function (RBF).

IV. Decision Tree

Decision Tree algorithm divides the dataset into several regions (shaped like boxes). It selects features and thresholds that can split the target well. Repeat this step until the desired max_depth is reached, which will create boundaries to classify data based on the regions created earlier.

There are several components of Decision Tree, namely Decision/Internal Nodes, Leaf Nodes, Branches, and Root Nodes. The shape of the Decision Tree resembles an inverted tree where the Root is at the top. The Decision Node is a test of the value of one attribute or feature. The collection of edges/branches from the node is a label with possible attribute/feature values. And the leaf node is an output value.

The steps for splitting involve using the calculation of MSE (Mean Squared Error) at each number in the feature space. The tree is formed recursively, one branch at a time. The split is determined based on the condition that will significantly decrease the cost function J. The algorithm stops when the stopping criterion is reached.

For classification problems, there is an issue of selecting the best attribute to split the data. Options that can be taken include randomly, using the least values (possible smallest value), the most value (possible largest value), and the max gain (the largest information gain/accelerating the size of the tree).

The concept of entropy can describe the level of impurity of a group. For highly impure classes, they have a high degree of randomness, while the least impure ones have a tendency towards homogenous classes. The value of entropy is obtained using the following equation:

A good set for model learning is a set that has high impurity. Besides entropy, the Gini index and classification error can be used to calculate the error. Additionally, there are other options for splitting data. A good attribute will split the data into an ideal subset, which is all positive or all negative. Next, search for the attribute that most reduces entropy, given the previous attribute, which is called information gain = entropy(parent) — [average entropy(children)]. Adding branches/trees will increase the prediction accuracy. The more trees, the higher the prediction accuracy can be achieved, up to 100% on the training data. However, the accuracy on the test data will decrease.

Decision tree is vulnerable to overfitting. Without stop criterion during the tree building process, it can achieve 100% accuracy on the training data. The solution is to perform tree pruning. The steps that can be taken are to create a super complex tree, then select the tuning parameter alpha (non-negative float). For each alpha value, there will be a subset T such that the cost function is minimized. Prune the tree for the validation set’s performance to maximize the validation set’s performance.

The advantage of the decision tree model is that it can model high-dimensional data. In addition, the tree model is easy to interpret. However, this model has limitations in modeling linear shapes. Furthermore, this model tends to have a high model variance, which increases the potential for errors and decreases the interpretability of the data. This model is suitable for combined models such as boosting and bagging. This model is considered an unstable model because a slight change in the data can result in a significant difference in the data partition.

V. Random Forest (Ensamble Model)

This model is related to the ensemble model concept, where the idea is to create several models with their own errors. Using less correlated classifiers will reduce the variance and improve the model’s performance. These models are obtained from data through bootstrapping, a resampling method with replacement.

In the concept of Decision Tree, the tree can become unstable and have high variance. To reduce the variance, B-Bootstraped datasets are created and each tree is fit to each bootstrapped dataset to obtain B-Tree Models. For classification models, the majority vote is used.

The number of trees B is not a critical variable. Using a high value of B does not make the model overfit. Practically, using a large value of B can decrease the error rate.

To reduce error, feature randomization can be done in the tree.

5. Experiment/Result/Discussion

Due to the target imbalance value of 95%, the accuracy metric is less suitable as the main metric used to determine the best model performance in the stroke classification case. The business case in this classification case is better suited to using metrics that reduce the value of False Negatives (FN).

If someone is predicted by the model as not having a stroke but in reality is suffering from a stroke, it will be more dangerous because preventive treatment for the individual cannot be carried out. Moreover, it can also be more dangerous because medical intervention should be performed to provide healing.

There are several metrics that can be used for classification models, including:

Accuracy: (Total Correct Predictions) / (Total Observations)
Precision: True Positive / (True Positive + False Positive)
Recall: True Positive / (True Positive + False Negative)
F1 Score: Harmonic mean between precision and recall
ROC (Receiver Operating Characteristics Curve): Plot of TPR (True Positive Rate) and FPR (False Positive Rate)
AUC (Area Under Curve): The closer to 1, the better.

From the explanation above, a good Recall value is needed to achieve good model performance. Next, for the experimental stage, fitting of the vanilla model (default model) to X_train is performed for all models discussed above. Model fitting is also performed for all datasets that undergo Random Under Sampling, Random Over Sampling, and SMOTE. The resulting models are as follows:

In addition, the Cross Validation process is carried out to determine the best parameters that can be used by each model to improve its performance. A value of k equal to 5 is used in Grid Search CV, meaning that the training data is split into 5 segments and Cross Validation is performed for each segment of data.

For the Logistic Regression model, the value of the penalty model (Ridge or Lasso) is set to L1 and L2, while the value of C or penalty parameter is set using a range of logspace(-5, 5, 20). For the K Nearest Neighbor model, the value of n_neighbors or the number of neighbors is set to 3, 5, 7, and 9 with Uniform and Distance weight values. For the Decision Tree model, the value of max_depth for the tree is set to 2 to 12 with gini, entropy, and logloss criteria. For the Support Vector Machine model, the kernel parameter is set to linear, poly, and rbf with a value of C equal to logspace(-4, 4, 20). Based on this experimentation, the best parameters are obtained as follows:

Based on the accuracy metrics, it was found that the decision tree and random forest models have an accuracy score of 100%. This indicates the potential for overfitting of these models to the training data. From the experimental results, the accuracy, precision, recall, and f1 scores were obtained for the training and validation data of each experiment as follows:

For Support Vector Machine and Random Forest models, Grid Search CV was not performed due to the high computational cost during experimentation, which would require a very long running time.

Therefore, it was decided not to perform Grid Search CV on these models. Next, ROC and AUC plotting were performed on the training and validation data configurations, and the following values were obtained:

Based on the experimentation on the training and validation data, the best AUC score was obtained on the Random Under Sampling data. This is expected because fewer data points are being predicted, resulting in a higher likelihood of accurate predictions. The Decision Tree and Random Forest models overfit, so tree pruning was performed by reducing the max depth to 11.

Next, the model fitting process was carried out on the test data to obtain the model performance. The model used on the test data is the model created on the training and validation data. The following are the results of the model performance after modeling on the test data:

Based on the results of the experimentation, it was found that Logistic Regression has the best performance in terms of Recall (0.760) and AUC (0.839) so that it is chosen as the best model even though the Accuracy is only 0.767. The next model that can be used is the Logistic Regression CV model with an Accuracy of 0.952, Recall of 0.480, and AUC of 0.812. However, this model cannot be used because it only predicts non-stroke. Next, the third model that can be used is the Random Forest model with an Accuracy of 0.911, Recall of 0.520, and AUC of 0.839. These values can be depicted in the following Confusion Matrix:

From the experiment, it can be concluded that due to the imbalanced data (>90%), the tree models overfit heavily to the training data. As a result, these models cannot represent good prediction results on the test data even though tree pruning has been done in the Cross Validation process. The logistic regression model can achieve higher Recall and AUC values better.

6. Conclusion / Recommendation

a. For further experimentation, Principal Component Analysis can be used to reduce the number of features, which can reduce the overfitting conditions of the model.

b. There is overfitting in the tree-based models, namely Decision Tree and Random Forest, resulting in an accuracy, recall, and AUC of 1 for the training data, but it experienced a significant decrease on the test data, so further tree pruning and hyperparameter optimization are needed for these tree-based models.

c. Based on the experimentation in this case, the best algorithm in Supervised Machine Learning process is Logistic Regression with recall value (0.760), AUC (0.839) and accuracy (0.767) on the test data. Meanwhile, on the training data, the recall value obtained was (0.77), AUC (0.847), and accuracy (0.746).

d. Performance metric accuracy is not the main metric used to measure model performance because the data is imbalanced >95%. Therefore, recall and AUC metrics are used.

Thankyou for your time to read this article, hope all of you gain a new insight.

STROKE PREDICTION ANALYSIS (ML PROJECT)

Written by Mory Handy