Predicting Heart disease using Machine Learning

Pulkit Khandelwal
Analytics Vidhya
Published in
7 min readAug 19, 2021

This article discusses various Python-based ML and Data Science libraries to build a Machine Learning model capable of predicting whether or not someone has heart disease based on their medical attributes.

We’re going to take the following approach:

1. Problem Definition

2. Retrieving Data

3. Understanding Features

4. Data Preparation and its tools

5. Exploratory Data Analysis

6. Modelling

7. Model Evaluation

1. Problem Definition

Given clinical parameters about a patient, can we predict whether the patient has heart disease or not?

We aim to reach a model accuracy of more than 85%. If the model scores better than 85%, we will select the model.

2. Retrieving Data

The original data came from the Cleveland data from the UCI Machine Learning Repository and also a version of it available on Kaggle.

3. Understanding Features

1. age: displays the age of the individual.

2. sex: displays the gender of the individual using the following format :

• 1 = male

• 0 = female

3. cp (Chest-Pain Type): displays the type of chest-pain experienced by the individual using the following format :

• 0 = typical angina

• 1 = atypical angina

• 2= non — anginal pain

• 3 = asymptotic

4. trestbps(Resting Blood Pressure): displays the resting blood pressure value of an individual in mmHg (unit)

5. chol(Serum Cholestrol): displays the serum cholesterol in mg/dl (unit)

6. fbs (Fasting Blood Sugar): compares an individual's fasting blood sugar value with 120mg/dl.

• If fasting blood sugar > 120mg/dl then : 1 (true) else : 0 (false)

7. restecg (Resting ECG): displays resting electrocardiographic results • 0 = normal

• 1 = having ST-T wave abnormality

• 2 = left ventricular hyperthrophy

8. thalach(Max Heart Rate Achieved): displays the max heart rate achieved by an individual.

9. exang (Exercise induced angina):

• 1 = yes

• 0 = no

10.oldpeak (ST depression induced by exercise relative to rest): displays the value of an integer or float.

11.slope (Peak exercise ST segment) :

• 0 = upsloping

• 1 = flat

• 2 = downsloping

12.ca (Number of major vessels (0–3) colored by fluoroscopy): displays the value as integer or float.

13.thal: displays the thalassemia (is an inherited blood disorder that causes your body to have less hemoglobin than normal) :

• 0 = normal

• 1 = fixed defect

• 2 = reversible defect

14.target (Diagnosis of heart disease): Displays whether the individual is suffering from heart disease or not :

• 0 = absence

• 1 = present

4. Data Preparation and its tools

Pandas & Numpy for Data Analysis and Manipulation

Matplotlib and Seaborn for Data Visualisation

Scikit-Learn for the Modelling and Evaluation

5. Exploratory Data Analysis

6. Modelling

We must experiment with the models, trying 3 different models, get the results, and compare them later.

Now we have got our data split into training and test sets, it is time to build a Machine Learning model.

We will train it (find the patterns) on the training set and test it (use the patterns) on the test set.

We’re going to try 3 different Machine Learning models:

1. Logistic Regression

2. K-Nearest Neighbours Classifier

3. Random Forest Classifier

Now we have a baseline model…and we know a model’s first prediction isn’t always final.

What we should do to base our next steps off?

Let’s look at the following:

  • HyperParameter tuning
  • Feature Importance
  • Confusion Matrix
  • Cross-Validation
  • Precision
  • Recall
  • F1-Score
  • Classification Report
  • ROC Curve
  • Area under the curve(AUC)

We got an accuracy of approx 89%. Therefore, we will select the model.

Evaluating our tuned Machine Learning model classifier, beyond accuracy

  • ROC curve and AUC score
  • Confusion matrix
  • Classification report
  • precision
  • recall
  • f1-score

and also Cross-Validation

To make comparisons and evaluate our trained model, first, we need to make predictions.

  1. ROC curve and AUC score

2. Confusion Matrix

3. Classification Report

4. Cross-Validation

  • Accuracy
  • Precision
  • F1-Score

Comparison of Cross-Validation Classification Metrics

Feature Importance

Feature Importance is another way of asking, “which features contributed most to the outcomes of the model and how did they contribute?”

Conclusion

Heart Disease is one of the major concerns for society today. It is difficult to manually determine the odds of getting heart disease based on risk factors. However, using Machine Learning, we will predict whether the person is suffering from heart disease or not in no time. Due to the fast and accurate classification of Heart Disease, doctors will provide proper treatment to the patients and save their life.

Happy Learning :)

Github Link :- https://github.com/pulkitkhandelwal29/Heart-Disease-Classification

--

--