How I built a Model to Predict Heart Disease
Heart disease is a very common health issue that people have to deal with in our current world. According to the New York State Department of Health about 695,000 people die of heart disease in the United States every year–that’s 1 in every 5 deaths. That’s not even including the people it affects in other countries. People are losing their loved ones and society is losing influential people to this life threatening disease. But what if we could do something to stop this?
This is where AI comes into play. People have been using Artificial Intelligence (AI) to fight against heart disease in some interesting ways. AI can analyze huge amounts of data quickly and find patterns that might be hard for humans to spot. For example, researchers are using AI to study medical records and identify factors that could increase the risk of heart disease. A lot of machine learning engineers have also been developing systems that can help doctors diagnose heart disease faster.
After researching this problem I decided to try to build a basic machine learning model that figures out if a patient has a healthy or defective heart.
Now let’s get into it!
Steps for Creating the Model
Importing the libraries and Setup
First, I downloaded the dataset as a CSV file. This CSV file contained different instances of all of the different variables and the target value(1 or 0). The target value represents whether the person has heart disease or doesn’t have it. To properly create this model I will need to import packages such as numpy, pandas, sklearn, and seaborn, among others. This will allow me to set up the background for the model, allowing me to implement certain type of functions to help make the model. I imported numpy so that I could turn the input data that the user gives into an array that our model can easily read. I imported seaborn so that I could visualize our dataset to identify different trends in the data. I also imported pandas so I could manipulate the CSV file and analyze the different pieces of the dataset.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
Let’s take a look at some of the arguments being used:
from: This keyword helps you access the specific library that you want to import a specific function from. This can also be used to access files from a specific folder.
import: This keyword helps you access the specific function or file that you want to import.
Collecting and Processing the Data
Before I make my model I need to make sure that the data I train the model on is clean and processed in a way that my model can easily understand it. To do so I had to first get information about the dataset. This included what it looked like, statistics, the shape, etc.
#Loading the CSV file data into a Panda Dataframe
heartData = pd.read_csv("/content/heart_disease_data.csv")
#printing the first 5 rows of the dataset
heartData.head()
#printing the last 5 rows of the dataset
heartData.tail()
#number of rows and columns in the dataset
heartData.shape
#more info about the data
heartData.info()
#cleaning data by checking for missing values
heartData.isnull().sum()
#statistics about the data
heartData.describe()
#checking distribution of the target variable
heartData['target'].value_counts()
Lets consider each of these functions:
read_csv(): it turns the csv file into an array that can easily be processed and manipulated using other functions
head(): shows the first 5 rows of the dataset
tail(): shows the last 5 rows of the dataset
shape(): shows the number of rows and columns in the dataset
info(): shows the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).
isnull(): checks where there are null values in the dataset. This is needed because null values would lead to inaccurate data.
describe(): shows statistical data bout the data such as the mean, count, minimum, and maximum of each variable within the dataset
value_counts(): shows the count of each value of a variable. In this case the only values the variable target can have is either 0 or 1. As a result, it shows the count of the number of 0’s and 1's.
Yeah so I know it looks like a lot, but once you know what the keywords mean it’s pretty easy to understand.
Visualizing the Data
To properly understand the trends in the data I used seaborn to visualize the data. Data visualization is basically making a physical representation of the dataset that you are using to train the model. You can visualize data with different types of graphs, plots, and charts. In this case I chose to visualize the data with a boxplot. I focused on the variable age and was focusing how the target value changed as the age changed. In simple terms I was able to understand the age range of someone who has heart disease vs someone who doesn’t have heart disease.
sns.boxplot(y=heartData["age"], x=heartData["target"])
Lets take at the different arguments this line of code uses. Basically the library seaborn is calling upon the function boxplot(). This creates a boxplot with a certain variable as the x axis and another as y axis. I set the y axis equal to the age column of the heartData dataset. I set the x axis equal to the target column of the heartData dataset. As a result, it creates a boxplot for each of the x values which are 0 and 1. In each boxplot it shows the range, median, and mean of the age for each specific target value.
Splitting the Features and Target
In order to be able to approach this problem as a binary classification problem(output is between two options) we have to split the features and the target variable. In other words we have to make sure that we split our data into input variables and the output values of 0 and 1. This is important so we can see how the input data affects the output data. In order to do so I made use of the function drop() in the pandas library.
X = heartData.drop(columns='target', axis=1)
Y = heartData['target']
print(X)
print(Y)
In this code segment X is the original dataset without the target column. In other words it just contains the features(input variables) of the dataset. Y is just the target column of the original dataset, which is just a column with the values 0 and 1 indicating if a person has a healthy heart or unhealthy heart.
Splitting the Training and Testing Data
In order to be able to properly make and test the model we have to split our dataset into training and testing data. The training data will be the data the model uses to learn how to identify a person has heart disease or is healthy. The testing data is what will be used to test the accuracy of the model. Imagine you’re the teacher at school and one of your students is the model. The training data would be the homework and ungraded assignments you give them to learn the content. The final exam would be considered the testing data as it shows if our student knows the content.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)
print(X.shape, X_train.shape, X_test.shape)
Lets break down the argument of this function:
We are inputting 5 different parameters into this function.
X: X in this case is just the features of the dataset. This basically just contains all of the different features and variables that help us determine our target value.
Y: Y just represents the outputs of our dataset. In this case it is the target value 0 or 1.
test_size: This variable represents the proportion of the dataset to include in the test split. In simple terms it is the percentage of the dataset that will be used to test the model later. In this case we are using 20 percent of the dataset as test data.
stratify: This helps us do a technique called stratified sampling. In simple terms it ensures that the distribution of labels will be the same in the training and test sets as they are in the original dataset, limiting a bias towards healthy or unhealthy hearts.
random_state: This object controls randomization while splitting the data.
Scaling the Features
Feature scaling is the process of normalizing the range of features in a dataset. Real-world datasets often contain features that are varying in degrees of magnitude, range, and units. In order to keep out bias towards one feature we need to perform feature scaling. To do so I used the StandardScaler from sklearn.preprocessing to scale the features.
sc_x = StandardScaler()
xtrain = sc_x.fit_transform(X_train)
xtest = sc_x.transform(X_test)
print (xtrain[0:10, :])
StandardScaler in machine learning is like a chef’s tool that helps ensure all the ingredients in a recipe are on a similar scale. Imagine you’re following a recipe where some ingredients are in grams, some in kilograms, and others in milliliters. StandardScaler helps you convert everything to a standard unit, like grams, so that each ingredient contributes equally to the overall flavor of your dish.
Let’s take a look at the functions used to scale the features:
fit_transform(): This function computes the parameters for proper transformation. These parameters are mean and standard deviation. THe function then transforms the training data making each feature have a mean of 0 and a standard deviation of 1. This makes all features have the same weight when the model gets trained on them. This function takes the training dataset(that is focused on features) as the input.
transform(): This functions immediately performs scaling of the data using mean and standard deviation. The parameters are already computed before using this function.
Although, these functions are pretty similar it is important to understand the difference. Imagine you’re learning a new recipe, and part of the process involves preparing a special sauce. You first read the recipe (fitting), gather all the necessary ingredients, mix them together, and finally, apply the sauce to the dish (transform). In this analogy, fit_transform() is like following the entire recipe from start to finish, including both preparation and application. Now, let’s say you’ve already mastered the sauce recipe and prepared a batch. Later, you decide to use the same sauce in a different dish. You wouldn’t need to go through the entire recipe again; you’d simply take your pre-made sauce and apply it to the new dish. In this analogy, transform() is like applying the already prepared sauce to a different dish without going through the full recipe.
Handling Imbalanced Classes
In order for the model to not be biast towards one output more than another we have to handle imbalanced classes. In simple terms we have to make sure that the training dataset has a similar amount of instances with target value 1 and 0. There are two ways we can go about doing this undersampling or oversampling. In this case we have more values of 1 than values of 0. Since it is a better practice to have more data we will use oversampling and add more examples of instances that have the target value of 0. In simple terms make sure that we increase the values of 0 to match the values of 1.
print("Before OverSampling, counts of label '1': {}".format(sum(Y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(Y_train == 0)))
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 2)
X_train_res, Y_train_res = sm.fit_resample(X_train, Y_train.ravel())
print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(Y_train_res.shape))
print("After OverSampling, counts of label '1': {}".format(sum(Y_train_res == 1)))
print("After OverSampling, counts of label '0': {}".format(sum(Y_train_res == 0)))
SMOTE(): SMOTE is a way to oversample the minority class in the dataset. In other words it helps us generate data for the class that doesn’t have enough data points. SMOTE does this by creating new data points from already existing data points in the minority class. However, it doesn’t directly replicate the existing data points. Instead it uses the existing data as a template to create new data points.
fit_resample(): This function helps us resample the original unbalanced data into new balanced data. It helps us transform and resample within the minority class leading to equal counts of both target variables. In other words there are now an equal number of healthy and unhealthy hearts being used to train the model.
Training the Model
In order to train the model I have chosen to use logistic regression. I chose logistic regression because logistic regression is one of the best algorithms to use for binary classification problems. Binary classification is basically a problem that has you choose between two options. For example, choosing between going to a movie or not going to a movie is a binary classification problem. Logistic regression predicts the likelihood of a binary outcome by assigning weights to input features, combining them, and transforming the result into a probability using a logistic function. The model then sets a decision boundary to categorize observations based on this probability, making it a valuable tool for binary classification problems. In simpler terms, it helps us estimate the chances of something happening or not happening based on given data.
model = LogisticRegression(max_iter=1000)
model.fit(X_train_res, Y_train_res.ravel())
X_train_res is the training data that is based off the features(variables) the model has to consider to reach its target. Y_train_res is the training data that is based off the targets that the model uses to figure out how the target changes with each variable.
Model Evaluation
Evaluating the model is a very important step as it lets us know how accurate the model is. There are many different ways to evaluate the model but the ones I chose were accuracy score and classification report. I chose classification report because it includes a bunch of different evaluation metrics including precision, recall, and F1 score.
predictions = model.predict(X_test)
print(classification_report(Y_test, predictions))
The code is basically testing the model on the testing data that is based on the features. After we get a dataset of the predictions we use classification report to see how accurate the predictions are based on the actual correct target values(Y_test). In other words we are testing the models predictions.
Now lets take a look at each of these model evaluation methods that the classification report uses.
Classification Report: A report of different metrics that are used to evaluate the model. These include precision, recall, and F1 score.
Recall: Recall, also known as sensitivity or true positive rate, measures the ability of a model to correctly identify all relevant instances (true positives) out of the total actual positives (sum of true positives and false negatives).
In simple terms, recall answers the question: “Of all the actual positive cases, how many did the model correctly identify?” Imagine you’re trying to find your lost keys in a room. Recall is like having a flashlight that helps you see as much of the room as possible. A high recall means your flashlight is broad, ensuring you don’t miss any corners, but you might accidentally illuminate some non-important areas (false positives). Low recall, on the other hand, means your flashlight is narrow, and you may miss some places where your keys could be hiding (false negatives).
Precision: Precision measures the accuracy of the positive predictions made by the model. It assesses how many of the predicted positive instances are actually relevant (true positives) out of the total predicted positives (sum of true positives and false positives).In simple terms, precision answers the question: “Of all the predicted positive cases, how many were correctly predicted?” Precision is comparable to using a metal detector to find rare coins in a field; high precision means your detector rarely beeps, and when it does, it’s likely a valuable coin (few false positives), while low precision results in frequent beeping, including false alarms, leading to wasted time digging up non-valuable items (more false positives).
F1 score: The F1 score is a combination of both precision and recall, providing a balance between the two. It is the harmonic mean of precision and recall. In simple terms, the F1 score considers both false positives and false negatives and aims to find a balance between precision and recall. It is particularly useful when the class distribution is imbalanced. The F1 score is like crafting the perfect recipe as a chef. It ensures that your dish (model) achieves both the right balance of ingredients for flavor (precision) and covers all the essential elements to make it complete (recall). Just as a well-balanced recipe creates a delicious meal, a high F1 score indicates a model that successfully balances precision and recall for effective performance.
Here is the classification report that I ended up printing out:
0 and 1 are the two options for the output of the model. 0 stands for healthy heart and 1 stands for unhealthy heart.
Now that I’ve talked about the classification report let’s go into the accuracy score metric.
The accuracy score of a model is basically how many guesses the model made that were correct over the total amount of guesses made. In simple terms it’s just the percentage of correctness the model had over the testing data. Think of accuracy score like a test you are taking at school. The amount of questions you got correct divided by the total amount of questions as a percentage will be your score on the test.
#Training data accuracy score
X_training_prediction = model.predict(X_train_res)
training_data_accuracy = accuracy_score(X_training_prediction, Y_train_res)
print('Accuracy on Training data : ', training_data_accuracy)
#Testing data accuracy score
X_testing_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_testing_prediction, Y_test)
print('Accuracy on Testing data : ', testing_data_accuracy)
I ended up printing out the accuracy of the model on the testing data and training data which happened to be:
Accuracy on Training data : 0.8371212121212122
Accuracy on Testing data : 0.8360655737704918
Analysis of the Results
The accuracy of the training data and the testing data were very similar. Both of them were about 83–84% which is a pretty decent accuracy considering the dataset I used. For the accuracy to be increased the dataset could have been larger and I could have chosen a different algorithm.
In terms of the classification report the precision score was higher in unhealthy hearts, while the recall was higher in healthy hearts. The F1 score was overall barely higher in unhealthy hearts than healthy hearts. In all metrics the score ranged between 0.8 to 0.87 which means that the logistic regression model did a pretty good job in the binary classification problem it was assigned. However, you can clearly see that the data was slightly biased more towards unhealthy hearts as the F1 score was barely lower for healthy hearts. However, this makes total sense considering we had to oversample to make both healthy and unhealthy hearts have the same amount of training data.
Main Takeaways
Although the model didn’t achieve a higher accuracy I feel like I did a good job of coding the model and testing it. I learned a lot of new data science knowledge and techniques to use to optimize the model. I also learned how to manipulate data in different ways. Stay tuned for my next article showing off my Replicate #2.
If you would like to check out the whole project the link is below:
https://github.com/nikhilk476/artificialIntelligence/tree/main