Heart Disease Classifier
Heart disease is a common disease in the health field, it’s affecting a person’s life, in this post, we want to predict patients case if they have heart disease or not.
Before we start the steps of this project, we need to decide which problem we have. our problem is a binary classification where we need to classify the patients who have heart disease or not, and this case is a critical case . so, based on our problem characteristics, I chose two machine learning models (Logistic Regression and LinearSVC)for classification. and I decided to used three metrics to evaluate the performance of these models which are the precision, recall and confusion matrix.
We have five steps to build this project as follows:
1.Data gathering and loading
2.Data Exploration
3.Data Cleaning and Preprocessing
4.Build the ML models
5.Results
1.Data gathering and loading
I used the heart disease dataset from Kaggle, you can visit it by using this link.
This dataset has 14 columns such as age, sex, chest pain type, target if this patient has heart disease or not, etc.
2.Data Exploration
In this step, I discovered the dataset and obtained more information about it.
I discovering the target variable which is important for us to classify the patients.
As shown above, we have imbalanced data, we have 54% of inputs(patients) have heart disease and 45% of inputs do not have.
3.Data Cleaning & Preprocessing
In this step, I check if the data have missing values or not, have duplicated values or not, find the correlations between the variables, and last step in this process I converted the categorical variables to dummies variables.
3.1. Check if we have missing values
As shown above, We don’t have any missing values so, we will move into the next step.
3.2. Check if we have duplicated values
We have one row duplicated and I decided to delete it.
3.3.visualize the correlations between the variables
The important column we care about it is a target column and based on the heatmap, the target column has a strong positive correlation with slop,thalach, and cp and strong negative correlation with ca,exang,thal, and oldpeak. so, I will select these variables to build our model.
3.4.convert the categorical variables to dummies variables
we have three categorical variables we need to convert it before we build the model, and we notice we have a range of values varies widely we need to scale it.
4.Build the ML models
In this step, I decided to run two models(LogisticRegression and LinearSVC) with a grid-search technique to hyperparameters tuning.
In each pipeline, we need to scale the features by using StandardScaler and I decide to take two parameters we need it to tune the model by using a grid search technique.
After we build the two pipelines we are ready to train the models.
Now, Evaluate the models using Precision and recall
Finally, run the pipelines and return the trained models.
In this step, I Split the new dataset to training and testing set, then train the models, evaluate it, and return the trained models.
5.Results
In this step, run the pipelines and display the precision and recall score for each model, then returns the trained models with testing dataset. finally, plot the confusion matrix for each model.
5.1 Precision and Recall score
We notice the highest precision and recall scores in the LinearSVC model.
5.2.confusion matrix
5.2.1.confusion matrix of LogisticRegression model
5.2.2.confusion matrix of LinearSVC model
we notice the LinearSVC model has a true positive about 88%, compared to the true positive of Logistic Regression it is 84%, and LinearSVC has a false negative about 12% compared to the false negative of Logistic Regression it is 16%.
5.3 Select the best model and parameters
I decided to select the LinearSVC model because we got on the best precision and recall score, and especially for this problem(heart disease), we focus on the patients who have a disease and the classifier classify it as doesn’t have. we notice the confusion matrix of LinearSVC has a false negative value 12% compared to the false negative of LogisticRegression it is 16%. so, we can use LinearSVC as a classifier for this problem.
Now, we can display the best model with its best parameters.
Conclusion
In this post, we built a heart disease classifier. we go through several steps of data science process, we used the heart disease dataset from Kaggle includes 14 columns that we need it for classification, we explored the dataset, data cleaning and preprocessing to prepared our dataset to train the models, then we used two ML models and evaluate it with appropriate metrics and selected the best model based on it.
Working with a small dataset is very challenging because it affects the performance of the model, and I tried to select the relevant features to avoid overfitting to get on the best performance.
Finally, I have suggestions to improve the performance of this model such as detect the outliers if we have, use techniques to handling with imbalanced dataset such as an oversampling technique.