Heart Disease Prediction with Auto ML(pycaret)

Jerry John
Analytics Vidhya
Published in
8 min readJun 14, 2020

You can find the full code and the data set here.

1. Introduction

This is a bit different from the usual Kaggle works you will see, where most of them are building the model using the raw method or can be said as the traditional method. The main problem of this method is that a large amount of time is wasted in data per-processing, feature selection, model selection, hyper parameter tuning etc. But now a days there are many Auto Ml which can be easily pip installed and can be used effectively. Many time-consuming works can be easily done with a couple of lines.Model accuracy will be greater than the model made by the traditional method in most cases.

About the datasets we are using.

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers till this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.

Generally this datasets gives a number of variables along with a target condition of having or not having heart disease

This is for those who need a highly accurate, perfect model, with less headache.

This is not for those who are trying to study the models, deeply into each machine learning algorithms.

2. Installing necessary packages

Before we move towards the coding part, first import the necessary packages.

!pip install pycaretimport numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #Data visualization
import seaborn as sns #Data visualization

pycaret : The AutoML we are using here is pycaret

numpy : Library used for linear algebra

pandas : Library used for data processing, CSV file I/O (e.g. pd.read_csv

matplotlib : Library used for data visualization.

seaborn : Library used for data visualization.

3. Getting the data

We can get the datasets using pandas library.

# Getting the datasets to "datasets" variable
dataset = pd.read_csv("../input/heart-disease-uci/heart.csv")
# Showing first 5 rows.
dataset.head()
First 5 rows of the datasets

4. Attribute Information

It’s a clean, easy to understand set of data. However, the meaning of some of the column headers are not obvious. Here’s what they mean,

age: The person’s age in years

sex: The person’s sex (1 = male, 0 = female)

cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)

trestbps: The person’s resting blood pressure (mm Hg on admission to the hospital)

chol: The person’s cholesterol measurement in mg/dl

fbs: The person’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)

restecg: Resting electrocardiograph measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)

thalach: The person’s maximum heart rate achieved

exang: Exercise induced angina (1 = yes; 0 = no)

oldpeak: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot. See more here)

slope: the slope of the peak exercise ST segment (Value 1: up sloping, Value 2: flat, Value 3: down sloping)

ca: The number of major vessels (0–3)

thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)

target: Heart disease (0 = no, 1 = yes)

Checking the data is balanced or not

As we can see 45% & 54% is a balanced data.

This is one of the most important step, that most of the beginners forget about. If the datasets we are having is not balanced we cant use accuracy score as the final factor which determines the quality of our ML model.

countNoDisease = len(dataset[dataset.target == 0])
countHaveDisease = len(dataset[dataset.target == 1])
print("Percentage of Patients Haven't Heart Disease: {:.2f}%".format((countNoDisease / (len(dataset.target))*100)))
print("Percentage of Patients Have Heart Disease: {:.2f}%".format((countHaveDisease / (len(dataset.target))*100)))
Percentage of Patients Haven't Heart Disease: 45.54%
Percentage of Patients Have Heart Disease: 54.46%

5. Visualizing the datasets (so that we can get a better idea about the datasets)

sns.countplot(x="target", data=dataset, palette="bwr")
plt.show()

Comparison between those who are having and not having a heart disease.

sns.countplot(x='sex', data=dataset, palette="mako_r")
plt.xlabel("Sex (0 = female, 1= male)")
plt.show()

This graph show the number of males and females in our datasets.

countFemale = len(dataset[dataset.sex == 0])
countMale = len(dataset[dataset.sex == 1])
print("Percentage of Female Patients: {:.2f}%".format((countFemale / (len(dataset.sex))*100)))
print("Percentage of Male Patients: {:.2f}%".format((countMale / (len(dataset.sex))*100)))
Percentage of Female Patients: 31.68%
Percentage of Male Patients: 68.32%

Just checking the percentage of male and female patients

pd.crosstab(dataset.age,dataset.target).plot(kind="bar",figsize=(30,15))
plt.title('Heart Disease Frequency for Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.savefig('heartDiseaseAndAges.png')
plt.show()

This graph shows us the frequency of heart disease with ages.

pd.crosstab(dataset.sex,dataset.target).plot(kind="bar",figsize=(15,6),color=['#1CA53B','#AA1111' ])
plt.title('Heart Disease Frequency for Sex')
plt.xlabel('Sex (0 = Female, 1 = Male)')
plt.xticks(rotation=0)
plt.legend(["Haven't Disease", "Have Disease"])
plt.ylabel('Frequency')
plt.show()

Comparison of male and female , with and without heart problem.

plt.scatter(x=dataset.age[dataset.target==1], y=dataset.thalach[(dataset.target==1)], c="red")
plt.scatter(x=dataset.age[dataset.target==0], y=dataset.thalach[(dataset.target==0)])
plt.legend(["Disease", "Not Disease"])
plt.xlabel("Age")
plt.ylabel("Maximum Heart Rate")
plt.show()

A scatter plot showing people of different age groups with and without having disease.

pd.crosstab(dataset.slope,dataset.target).plot(kind="bar",figsize=(15,6),color=['#DAF7A6','#FF5733' ])
plt.title('Heart Disease Frequency for Slope')
plt.xlabel('The Slope of The Peak Exercise ST Segment ')
plt.xticks(rotation = 0)
plt.ylabel('Frequency')
plt.show()

Comparison with The Slope of The Peak Exercise ST Segment.

pd.crosstab(dataset.fbs,dataset.target).plot(kind="bar",figsize=(15,6),color=['#FFC300','#581845' ])
plt.title('Heart Disease Frequency According To FBS')
plt.xlabel('FBS - (Fasting Blood Sugar > 120 mg/dl) (1 = true; 0 = false)')
plt.xticks(rotation = 0)
plt.legend(["Haven't Disease", "Have Disease"])
plt.ylabel('Frequency of Disease or Not')
plt.show()

Comparison with ‘FBS — (Fasting Blood Sugar > 120 mg/dl) (1 = true; 0 = false)

pd.crosstab(dataset.cp,dataset.target).plot(kind="bar",figsize=(15,6),color=['#11A5AA','#AA1190' ])
plt.title('Heart Disease Frequency According To Chest Pain Type')
plt.xlabel('Chest Pain Type')
plt.xticks(rotation = 0)
plt.ylabel('Frequency of Disease or Not')
plt.show()

Comparison with Chest Pain Type

6. Preparing the data for model selection

In this step we are splitting the datasets into two. The first part contains 95% of the data that is used for training and testing. The remaining 5% is stored and is used to try out with the final model we deployed (This data is named as unseen data).

data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions ' + str(data_unseen.shape))
Data for Modeling: (288, 14)
Unseen Data For Predictions (15, 14)

7. Importing pycaret classification method

The Auto Ml we are using here is pycaret. Its working is very easy as we want to give the datasets as data, our target column as target. We can also set some other features as shown below.

# Importing pycaret classification method
from pycaret.classification import *

# This is the first step of model selection
# Here the data is our datasets, target is the labeled column(dependent variable), section is is random number for future identification.
exp = setup(data = data, target = 'target', session_id=1,
normalize = True,
transformation = True,
ignore_low_variance = True,
remove_multicollinearity = True, multicollinearity_threshold = 0.95 )

After this we will get a list of our columns and its type, just confirm they are the same. Then hit enter.

8. Comparing the models

After confirming the column types, now we can run our datasets with a couple of ML algorithms and we can compare the performance. As shown below.

#This command is used to compare different models with our datasets.
#The accuracy,F1 etc of each model is listed in a table.
#Choose which model you want
compare_models()

In the above table we can see many models with good accuracy .

Next we should decide which algorithm has to be used and copy its code.

Codes for different models are given below.

Logistic Regression ‘lr’

K Nearest Neighbour ‘knn’

Naives Bayes ‘nb’

Decision Tree ‘dt’

SVM (Linear) ‘svm’

SVM (RBF) ‘rbfsvm’

Gaussian Process ‘gpc’

Multi Level Perceptron ‘mlp’

Ridge Classifier ‘ridge’

Random Forest ‘rf’

Quadratic Disc. Analysis ‘qda’

AdaBoost ‘ada’

Gradient Boosting Classifier ‘gbc’

Linear Disc. Analysis ‘lda’

Extra Trees Classifier ‘et’

Extreme Gradient Boosting ‘xgboost’

Light Gradient Boosting ‘lightgbm’

Cat Boost Classifier ‘catboost’

9. Creating the model

Now we can create the model.

# With this command we are creating a Linear Disc. Analysis model

# fold is the number of fold you want
lda_model = create_model('lda', fold = 10)

This table shows the accuracy and other reading for all 10 folds.

Now we are tuning the hyper parameters.

Tuning the hyper parameters will be very useful to increase the accuracy and other features.

For unbalanced datasets we mainly look F1 score, as our datasets is balanced we can use the accuracy.

10. Tuning the hyper parameters

tuned_lda = tune_model('lda', optimize='F1')

11. Plotting the ROC Curves

As the carve moves towards the x,y axis, the performance is increased.

plot_model(tuned_lda, plot = 'auc')

12. Confusion Matrix

As the number of true positive increases the models performance also increases.

plot_model(tuned_lda, plot = 'confusion_matrix')

13. Predicting the accuracy using the test datasets

We get a accuracy of 0.8506 and F1 score of 0.8632

predict_model(tuned_lda);

14. Checking with the unseen data

Initially we separated a part of the datasets as unseen data set for checking the final deployed model. Below we are checking this. The result is a data frame with Label and score(last two column). Where label is the predicted label and score is how many percentage the machine think the person is having a heart disease .

new_prediction = predict_model(tuned_lda, data=data_unseen)

15. Summary

As we see above, we got a high accuracy model with 86% accuracy, with no over fitting. Auto Ml is preferred more because it saves a lot of time and gives as a very good result. The hyper parameter tuning is not easy for the less experienced people, but hyper parameter tuning make a huge difference in the performance of the model.

As NO one is perfect, if anyone find any errors or suggestion please feel free to comment below.

--

--