Photo by National Cancer Institute on Unsplash

Prediction of heart disease

From the age of 20, heart disease becomes more common among older and younger adults.

Jermaine Matthew
11 min readDec 5, 2022

--

Heart disease is the leading cause of death among adults under 65 years of age, causing about 18.2 million deaths per year. In the United States alone, 659,000 people die from heart disease each year, making it the leading cause of death for one in four people. According to estimates, 16.3 million Americans aged 20 and older have coronary heart disease. A person’s level of heart disease can be predicted by analyzing certain factors.

Impact on business

From a large set of raw data, machine learning analyzes, extracts, and organizes information. Currently, Machine Learning is being used as a type of artificial learning for finding patterns in data and implementing the results into business decisions. There are many sectors in the healthcare industry. A machine learning algorithm can be used to assess a patient’s risk of heart disease. In addition to providing accurate diagnostics and treatment plans, a predictive machine learning model can identify high-risk patients and formulate effective treatment plans. Patients’ symptoms, data sets, and records can be used to predict heart disease and provide doctors with insights for creating treatment plans. Additionally, it enables patients to detect some early symptoms of severe heart disease, allowing them to seek medical attention and avoid missing the best time for treatment.

Data Exploration

The acquisition of

Kaggle provided us with a dataset on heart disease. On a scale of 0–4 (where 0 indicates no heart disease and 1,2,3,4 indicates heart disease), the response variable indicates whether a patient has heart disease. A response variable called ‘target’ is used in this particular dataset, which has two nominal values: 0 — heart disease not present, and 1 — heart disease present.

To give you a brief overview of the dataset, here is a sample:

The total number of observations is 1025. Only 14 columns of the original dataset contain the predicted attribute, as these columns are the most significant. This dataset does not contain any null values.

An overview

As shown below, the statistics for all attributes are:

Most of the attributes have a nominal value except for age, trestbps, chol, thalach, and oldpeak, which have nominal values.

The following are detailed descriptions of each attribute:

A visual representation

Using a class distribution where 1 indicates heart disease while 0 indicates no heart disease for our response variable — target:

Based on the target variable, we see that 526 patients have heart disease and 499 do not. Our target variable does not show a significant class imbalance. Hence, the performance of a model is evaluated based on its accuracy and confusion matrix.

For a better understanding of the data, here are the distributions for all 14 attributes:

As we can see in the histograms above, the age distribution is slightly skewed to the left, indicating that the dataset contains a large number of older individuals. There are more males in the dataset than females. A right-skewed distribution exists for the “trestbps” (resting blood pressure), the “chol” (serum cholesterol), and the “oldpeak” (ST depression induced by exercise relative to rest). Five continuous numerical variables and nine categorical variables are included in this study.

Despite a lack of correlation between the independent variables as shown in the correlation matrix, we still addressed the slight multicollinearity. Our strongest positive correlation is between “slope” and “thalach”, while our strongest negative correlation is between “oldpeak” and slope, -0.58. Next, for the positively correlated attributes, we see that thalach and exang have a correlation of 0.38. Meanwhile, age and thalach are negatively correlated with -0.39. Chapman and exang are negatively correlated with -0.4, while thalach and oldpeak negatively correlate. As well, “cp”, “thalach”, and “slope”, with values of 0.43, 0.42, and 0.35, have the strongest positive correlations with the target variable. The strongest negative correlation between “exang”, “oldpeak”, and “ca” is -0.44, -0.44, and -0.38 respectively.

In order to avoid any form of multicollinearity, we will drop “thalach” due to its high correlation with multiple attributes. Further, we will drop “slope” as it has a low correlation with the target variable and is highly correlated with oldpeak.

Analyzing and modeling

We ran the supervised models with Python and the unsupervised models with Weka, since this is a classification problem:

Supervised:

  1. K-Nearest Neighbors
  2. Naive Bayes
  3. Logistic Regression
  4. Decision Tree

Unsupervised:

Rules of the Association

Python’s sklearn library was used to run the four classifiers and perform pre-processing on our data for the supervised models. Our evaluation metrics included the confusion matrix and overall accuracy of the models, but we focused mainly on the recall of the models due to the possibility of misclassifying a patient as having heart disease. Using this formula, you can calculate recall: true positives/true positives + false negatives. Using our data, we will split it 70:30 between training and testing.

To create a benchmark for the model results, we ran each classifier on the unprocessed data first. Our next step was to perform grid search and cross-validation of 10 folds to optimize model performance by discretizing the continuous variables and tuning the model hyperparameters. In the final step, we used grid search to reduce the data down to the attributes that matter most and evaluated the model’s performance. Using SKLEARN’s SelectKBest tool, we selected the top k features for each algorithm using a pipeline, and then ran each algorithm using the selected features.

To determine what factors were associated with heart disease, we ran the association rule on the dataset using Weka.

Benchmark

As we can see from the results for the 4 classifiers after the train/test split, the decision tree classifier performed the best without any preprocessing. On the other hand, KNN performed the worst, especially when examining the recall, which indicates a high number of patients who are misclassified. From the confusion matrix, we can see that 33% of patients were misclassified as not having the disease when they in fact did. Logistic Regression and Naive Bayes both yielded similar results. Preprocessing techniques were applied to further improve the model’s performance.

The pre-processing phase

Grid search evaluates model performance using different hyperparameter values. Using the hyperparameter value that yielded the best model performance, the search then chooses the best model.

The C parameter is the hyperparameter we need to tune for logistic regression. Regularization strength is the inverse of this. By preventing overfitting of the training data, regularization improves a model’s performance on unseen data. During regularization, less significant variables’ coefficients shrink towards zero, which penalizes complex models. In general, a high value of C indicates that less weight should be given to the penalty, while a low value indicates more weight. The training data might not reflect real-world conditions, so smaller values of C are better.

A Decision Tree classifier will optimize the tree’s maximum depth as its hyperparameter. As soon as the tree has achieved optimal purity, this technique is used to avoid further splits. ‘Entropy’ was selected as the criterion for this classifier.

By using grid search, we can optimize the k number of neighbors for K-Nearest Neighbors. Finally, we optimized the var_smoothing parameter for the Naive Bayes classifier. By adding the largest variance part of all features to the variances, the var_smoothing hyperparameter smooths out the GaussianNB distribution curve, accounting for outlier observations.

Discretion

As these variables contain continuous values, we discretized them as “age”, “trestbps”, “chol”, and “oldpeak”. The attribute with the highest information gain is used in classification methods such as decision trees. A high number of unique values, however, will bias the information gain. These attributes should be binned or discretized to avoid this situation. As a result, age 25 is categorized under the 20–25 category and not just under age 25 as a single category.

We discretized these attributes using Python’s quantile-based discretization function (qcut). Instances are divided equally between bins by default in this tool. There is a discrepancy between bin intervals. Each bin interval, for example, contains approximately equal instances per bin, with the first bin containing a range of 20 and the next containing a range of 10, but each bin interval has a range of 10 instances. With the exception of oldpeak, which had a small range of continuous variables, all these continuous attributes were divided into four bins. With Sklearn’s LabelEncoder, the bins were converted to ordinal values. A bin of age (28–48) of interval is replaced by 0 in the first age bin, a bin of age (29–48) by 1, and so on.

Based on the discretized dataset and grid search, the following results were obtained:

The accuracy of the models is influenced differently by discretizing the dataset, based on the accuracies. We can see that KNN’s overall accuracy has significantly improved from the benchmark accuracy of 70.24%. There has also been a significant increase in the precision, recall, and f1-score. As compared to the corresponding benchmarks, there is no significant improvement in the accuracy of logistic regression, Naive Bayes, and Decision Tree classifiers. Despite the fact that the decision tree classifier’s precision improved to 1, its recall has degraded. This preprocessing step had the greatest impact on KNN, therefore.

Selection of features

Based on a scoring method, SelectKBest was used to select the top k features. Scores are calculated using the chi2 function. If the chi2 statistic between the predictor and response variable is high, the predictor variable is significant because of the dependence between the two variables.

It is necessary to know the value of k in order to select the top k features in each model. With SelectKBest tool, we created a pipeline for each classifier. Grid search was then used to go through the given values of k (ranging from 1 to 6). To make heart disease diagnosis easier, this range of features was selected to be as minimal as possible. There was no significant difference in the results across the classifiers when the range of k was increased, so 6 features were given to the grid search.

In both pipeline and grid searches, all four classifiers produced the same number of features. Based on the top six chi2 scores for the four classifiers, the SelectKBest tool selected the following features:

The overall accuracy of KNN as well as recall and f1-score were improved by applying the feature selection preprocessing step. The F-score and precision for Naive Bayes were both improved slightly. There was a slight improvement in both accuracy and recall for the logistic regression. The accuracy, precision, and f1-score of Decision Tree classifiers decreased. Recall remained the same. The KNN model is thus the best performer.

To determine if there are any attribute combinations that are more frequently associated with heart disease, we ran Apriori in WEKA. We can use the association rules as a self-exam list even though they do not indicate causality. It is important for a person to be aware of the risk of heart disease if he or she has more than one symptom and go to the hospital for further testing.

To decide how to discretize the data, we used J48’s accuracy before running the association rules. Among many different set-ups, we chose to discretize the continuous numerical attributes into three bins with equal frequency.

Weka’s association rules are shown below, with confidence as the metric type, 0.9 as the minMetric, and 10 as the number of rules:

On the basis of the rules that the right-hand side is suffering from heart disease, we learned that some attributes appeared more than once in various combinations, including “thalach” (the maximum heart rate), “ca” (the number of major vessels that can be seen fluorososcopically), “thal” (thalassemia), “exang” (exercise-induced angina), and “age”. In all 9 rules that indicate heart disease, “ca=0” and “thal=2” appear. It’s probably due to an imbalance of classes. It is significantly more common for cases to meet these two conditions.

A person may have a higher risk of heart disease if he/she has the following symptoms, based on WEKA’s association rules.

  1. An elevated resting blood pressure of more than 161.5 mmHg
  2. Fluoroscopy does not reveal any major vessels
  3. There is no cure for thalassemia
  4. A fasting blood sugar level of 120 mg/dl is considered normal
  5. Exercising causes chest pain
  6. The age range is 29 to 52

Conclusions

The following is a summary of each classifier’s accuracy score:

For all classifiers except the decision tree, feature selection has been the best pre-processing step. At the benchmark, the decision tree performed quite well, and it performed similarly after discretization. Performance of the model was negatively affected by feature selection. Data changes can speed up or slow down decision trees depending on how sensitive they are to them. KNN’s performance is significantly affected by preprocessing, especially discretization, when compared to both preprocessing steps and the benchmark. There was some consistency between Naive Bayes and Logistic Regression throughout the two steps of preprocessing. Thus, the K-Nearest Neighbor classifier performs the best.

A confusion matrix must also be evaluated in conjunction with the model’s overall accuracy. Unless the model was preprocessed for feature selection, it predicted more False positives than False negatives, which can be regarded as high costs since it can negatively impact a patient’s life.

It was important to focus on key features of the test data to improve predictions. It is also interesting to note that Naive Bayes and logistic regression models cannot increase precision and recall after preprocessing and feature selection.

Lastly,

To explore this heart disease dataset and determine which attributes are better for predicting heart disease and improve models’ accuracy, we applied the techniques we learned from course 273 to our project. In our project, we failed to consider the cost of the wrong prediction, which is crucial for selecting the right model in a real-world scenario. It might be possible to give weight to different attributes according to cost-efficiency in the future. Additionally, a larger dataset can help to improve model accuracy.

Fluoroscopy, chest pain, exercise-induced angina, ST depression, age, and sex are good factors to look at to determine whether a person has a heart condition or not.

Using the KNN classifier model, doctors can improve their predictions, insurance companies can improve their cost-efficiency, and testing laboratories can improve their efficiency. Conversely, the Association rules are beneficial for companies and developers who want to sell heart-related products and services through digital interfaces.

--

--