Heart Disease Classifier

Wejdan Aljadani
Analytics Vidhya
Published in
5 min readFeb 20, 2020

Heart disease is a common disease in the health field, it’s affecting a person’s life, in this post, we want to predict patients case if they have heart disease or not.

Before we start the steps of this project, we need to decide which problem we have. our problem is a binary classification where we need to classify the patients who have heart disease or not, and this case is a critical case . so, based on our problem characteristics, I chose two machine learning models (Logistic Regression and LinearSVC)for classification. and I decided to used three metrics to evaluate the performance of these models which are the precision, recall and confusion matrix.

We have five steps to build this project as follows:

1.Data gathering and loading

2.Data Exploration

3.Data Cleaning and Preprocessing

4.Build the ML models

5.Results

1.Data gathering and loading

I used the heart disease dataset from Kaggle, you can visit it by using this link.

This dataset has 14 columns such as age, sex, chest pain type, target if this patient has heart disease or not, etc.

figure1: load dataset

2.Data Exploration

In this step, I discovered the dataset and obtained more information about it.

I discovering the target variable which is important for us to classify the patients.

figure2: percentage of the patients for each class
figure3: Number of patients for each class

As shown above, we have imbalanced data, we have 54% of inputs(patients) have heart disease and 45% of inputs do not have.

3.Data Cleaning & Preprocessing

In this step, I check if the data have missing values or not, have duplicated values or not, find the correlations between the variables, and last step in this process I converted the categorical variables to dummies variables.

3.1. Check if we have missing values

figure4: Display the number of missing values for each column

As shown above, We don’t have any missing values so, we will move into the next step.

3.2. Check if we have duplicated values

figure5: Display the number of duplicated values and delete it

We have one row duplicated and I decided to delete it.

3.3.visualize the correlations between the variables

figure6: Heatmap of features

The important column we care about it is a target column and based on the heatmap, the target column has a strong positive correlation with slop,thalach, and cp and strong negative correlation with ca,exang,thal, and oldpeak. so, I will select these variables to build our model.

figure7: features selecting

3.4.convert the categorical variables to dummies variables

figure8: Convert the categorical variables to dummies variables

we have three categorical variables we need to convert it before we build the model, and we notice we have a range of values varies widely we need to scale it.

4.Build the ML models

In this step, I decided to run two models(LogisticRegression and LinearSVC) with a grid-search technique to hyperparameters tuning.

In each pipeline, we need to scale the features by using StandardScaler and I decide to take two parameters we need it to tune the model by using a grid search technique.

figure9: LogisticRegression Pipeline
figure10: LinearSVC pipeline

After we build the two pipelines we are ready to train the models.

figure11: train the model

Now, Evaluate the models using Precision and recall

figure12: Evaluate the model

Finally, run the pipelines and return the trained models.

In this step, I Split the new dataset to training and testing set, then train the models, evaluate it, and return the trained models.

figure13: Run Pipelines

5.Results

In this step, run the pipelines and display the precision and recall score for each model, then returns the trained models with testing dataset. finally, plot the confusion matrix for each model.

5.1 Precision and Recall score

figure14: Evaluate the two models using Precision and Recall score

We notice the highest precision and recall scores in the LinearSVC model.

5.2.confusion matrix

5.2.1.confusion matrix of LogisticRegression model

figure15: confusion matrix of LogisticRegression model

5.2.2.confusion matrix of LinearSVC model

figure16: confusion matrix of LinearSVC model

we notice the LinearSVC model has a true positive about 88%, compared to the true positive of Logistic Regression it is 84%, and LinearSVC has a false negative about 12% compared to the false negative of Logistic Regression it is 16%.

5.3 Select the best model and parameters

I decided to select the LinearSVC model because we got on the best precision and recall score, and especially for this problem(heart disease), we focus on the patients who have a disease and the classifier classify it as doesn’t have. we notice the confusion matrix of LinearSVC has a false negative value 12% compared to the false negative of LogisticRegression it is 16%. so, we can use LinearSVC as a classifier for this problem.

Now, we can display the best model with its best parameters.

figure17: Display the best model

Conclusion

In this post, we built a heart disease classifier. we go through several steps of data science process, we used the heart disease dataset from Kaggle includes 14 columns that we need it for classification, we explored the dataset, data cleaning and preprocessing to prepared our dataset to train the models, then we used two ML models and evaluate it with appropriate metrics and selected the best model based on it.

Working with a small dataset is very challenging because it affects the performance of the model, and I tried to select the relevant features to avoid overfitting to get on the best performance.

Finally, I have suggestions to improve the performance of this model such as detect the outliers if we have, use techniques to handling with imbalanced dataset such as an oversampling technique.

--

--