# Can Machine Learning Predict Heart Disease?

*Written by: Julian Bridy, Carmen Chen, Marvin Moran, and Cindy Wang*

The healthcare industry has long been an early adaptor of using machine learning to improve information management and patient care. From assisting doctors with developing diagnostics by analyzing x-ray images to helping hospitals handle big data, machine learning is evolving the way healthcare providers provide patient care.

In this project, we aim to understand which biological variables can predict a patient’s likelihood of having heart disease. The dataset looked at 11 independent variables to predict the dependent variable, Heart Disease.

# Objective

*Patient Care: *To predict the likelihood of a patient developing heart disease

*Identify Main Indicators:* Understanding which indicators are most significant to predicting heart disease may help prevent heart disease

Dataset

The data was made available on Kaggle. The dataset is a combination of five different datasets from various data depositories.

The 11 independent variables are: Age, Gender, ChestPainType, Resting Blood Pressure (RestingBP), Cholesterol, Resting Blood Sugar (RestingBS), Resting Electrocardiogram (RestingECG), Max Heart Rate (MaxHR), Exercise induce angina, exercise relative to rest (Oldpeak), and the slope of the peak exercise (ST_Slope).

**Data Analysis**

Before we dive deep into the ML modeling, we took a look at individual variables to see if we could glean any critical insights. Using Pearson Correlation in Python, we calculated and visualized a correlation heatmap. ExcerciseAngina, OldPeak, and ST_Slope were correlated at 0.39, -0.40, 0.49, 0.40, and -0.56, respectively. Let’s dive into a few of these variables a little more.

To visualize the data, the dataset variables were converted to nominal variables before using Tableau to analyze the individual variables.

As resting BP increases, there are more peaks and more likelihood of heart disease. This is not a surprise as lower resting BP is commonly associated with having a healthy heart and lifestyle. Although resting BP is normally distributed, it is surprising that there are still a decent amount of people with normal resting bp experiencing heart disease.

Exercise-induced angina is another variable that demonstrated some correlation. More people who experienced exercise angina had heart disease than those without exercise inducted angina. Angina happens when your heart demands more blood due to physical stress. The pain can be so intense that some people mistake it for a heart attack. Looking at the data, it may mean that people with heart disease will experience angina and chest pain because of clogged arteries or another heart disease that weakens their ability to pump blood.

Finally, oldpeak is the peak of the heart ECG chart. An oldpeak of zero, which indicates “Up”, indicators a weakened heart. Unsurprisingly, those with an “up” old peak were more likely to have heart disease. Elevated ST slope is an indicator of a heart attack. As shown in the chart, almost all of the observations with elevated ST heart ECG have heart disease.

# ML Methodology

To select a model with high accuracy in predicting if a patient has heart disease, we compared the accuracy of four models: KNN, decision tree, LOG regression, and 1-layer deep learning.

## Model 1 — K-Nearest Neighbor (KNN)

KNN classifier is a supervised learning technique. It is used to classify a data point based on how its neighbors are classified. The input consists of K closest training examples in a data set. It classifies the new data based on a similarity measure. This is best utilized when the labeled data is already available to us. This process would allow us to predict heart disease or not — based on our K neighbors or clusters and how similar each new data point would be to these respective clusters.

The first step was to prepare the data for the KNN model by importing our sklearn package and then process to split out test and train data. We chose a random state of 66.

*Process*

Step 1: We began by converting all of our categorical variables to numerical

Step 2: Tried N neighbors from 1–15 setting our range (1,15) to assess the accuracy of different N values

Step 3: Next we built the model and recorded the train and test accuracy

As ’n’ value increased, the accuracy of the training data decreased, and test data increased.

N = 13 is the optimal number of clusters as it is the point where we can achieve optimal accuracy for both the training and the test data without overfitting or underfitting.

With N=13, the model produced an accuracy of 0.73, precision of 0.77, and recall of 0.73. As this model is used to predict heart disease, Type II error, false negative, is an Important metric to take into consideration. The KNN model produced a type 2 error of 34 out of 230, which is not ideal. Next, we will look at the decision tree classifier to see if we can develop a better model.

## Model 2: Decision Tree Classifier

Decision Trees are a supervised learning method used to create a model that predicts the value of a target variable — in our case whether individuals would have heart disease — by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation. This is best used when classifying and predicting.

*Process:*

Step 1: We defined our x variables to train and y variables to predict. The data was then split into training and testing set with a 0.3 test size, and a random state of 42

Step 2: Generated the decision tree model, train our model, and predict the outcome for test data.

*Below is the initial result*

The initial decision tree yielded an accuracy of 0.75, a precision of 0.84, and a recall of 0.72 — pretty decent numbers. But, the model is flawed. The type 2 error, false-negative, is relatively high at 46. Since we are trying to predict heart disease, our model should have low type 2 errors as it is critical to limit the number of false-negative in predicting heart disease.

Step 3: Pruning the tree. This process would allow us to predict heart disease or not — based on our K neighbors or clusters and how similar each new data point would be to these respective clusters.

Pruning the tree produced a more accurate model with an improved Type II error. We used a depth = 5. To this point, the code in the screenshots above was depth = 3 but a 5 increases all the performance measures.

Pruning tree increased that accuracy performance from 76% to 84% (an 8% difference).

Finally, we visualized these features to drill down on those that would be most impactful (below).

Similar to the Tableau plots, ST_Slope, Oldpeak, and ChestpainType show the highest importance level.

## Approach 3: Logistic Regression

This technique is a supervised learning technique used to predict a dependent target variable. It’s typically best to predict a binary outcome — yes or no. This will also help us determine the most impactful independent variables as probabilities are assigned.

*Process:*

Step 1: Importing through our sklearn package

Step 2: We defined our x and y variables.

Step 3: Split our data set into train and test with 0.3 test size and random state of 42.

Step 4: Generated the Logistic Regression Classifier, trained the model, and then predicted the outcome for our test data

The Log Regression model performed better compared to the previous two models. This is no surprise since we know that this type of model is effective with a two-event outcome.

## Approach 4: Deep Learning

Deep learning is a type of machine learning based on artificial neural networks in which multiple layers of processing are used to extract progressively higher-level features from data by continuously re-assigning weights to the input variables. This is typically an unsupervised technique.

*Process*

Step 1: We utilized our sklearn package to import our MLP Classifier

Step 2: We then train our data and predicted our outcome without initial accuracies of .77 and .79 on train and test respectively

Step 3: To improve accuracy, we proceeded to rescale our data

Step 4: We imported our StandardScaler package from sklearn in our preprocessing effort.

Step 5: We proceeded to train and test the new scaled variables within the MLP model.

*Results*

- Our result improved accuracy to .905 and .909 on train and test respectively
- This preprocessing effort resulted in a dramatic increase of accuracy of nearly 17%!

*Deep Learning Correlation Plot*

Each square shows the correlation between the variables on each axis. Correlation ranges from -1 to +1. Values closer to zero mean there is no linear trend between the two variables. The closer to 1 the correlation is the more positively correlated they are. that is as one increases so does the other and the closer to 1 the stronger this relationship is.

Even though the respective algorithm is comprised of 1 neural network layer, the model is more complex compared to the previous three models. Like we saw with the DT Classifier model, pre-processing techniques had a significant impact on this model’s performance as well. The rescaling of the data increased the accuracy from 77% to 91%.

# Key Findings

## General Findings

** False negatives are expensive.** If we predict that an individual does not have heart disease, but they do, this may result in death from a heart attack. Type II measure is critical for selecting the optimal model. As such, a model that produces the lowest Type II error would be the best model.

** Variables: **10 of the 11 independent variables are significant and 2 different techniques identified slightly different top 3 contributor variables. For the Pearson correlation visual, the top three contributors are, ST_Slope, ExerciseAngina, and Oldpeak. For the

*importance feature visual,*the top three contributors are,

*ST_Slope, ChestPainType, and Oldpeak. The top three variables changes because Pearson’s correlation captures*

*linear relationships*between the input and target variables as opposed to a decision tree feature importance identity which variables are most influential when

*differentiating classes*.

**Model Specific Finding**

Not all models performed the same relative to performance measures.

Of the four models, deep learning was the most effective in all performance measures for the type of prediction we are targeting. It had the best accuracy, precision, recall, and type II error.

# Challenges

The dataset used for this project was rather small at 906 observations. Although deep learning produced a good model, we would have liked to use a larger dataset to develop a better model.

The second challenge we faced was Identifying proper techniques (models) to meet our needs. We were looking for a type of model(s) that was equipped to handle the prediction of binary variables (e.g., ‘Yes’ or ‘No’). In addition to the models mentioned in this paper, we also explored other models like multiple regression and time series but they did not make much sense for our type of data.

While measures like accuracy, precision, and recall are critical for model selection, we determined that because a Type II error in this type of prediction is costly (as in loss of life costly) then such would become the ultimate measure to be considered. Luckily, we found one model that meets the criteria for having the highest accuracy, precision, recall, and the lowest Type II error percent.

Lastly, given that we have a small data set, splitting data was a bit tricky. However, we found that doing a 30/70 split worked out the best for most models.

# Conclusion

These types of models would have significant value in a business setting such as clinics, hospitals, and institutions alike. For example, they can be used to make predictions of patients who may be at risk of developing heart disease and begin proactive treatment as opposed to reactive treatment.

The benefit? Well, saving lives!

Not all variables captured in the study contribute equally towards the prediction of heart disease. Here are two recommendations on how we can capture better data:

- Conduct dimensionality reduction exercises to focus only on valuable variables.
- Capture different variables to predict.

As the initial model iterations were built, it quickly became evident that not all models performed the same as shown by selected performance measures. The performance measures are dependent on the type of ML exercise one engages in. For us, we made a decision based on a higher principle — the cost of being wrong. To ensure the proper data is collected, the data collector should really think about the end goal of the ML exercise and pair that with the appropriate performance measure and variables.