‘re you a heart patient? this is for you! Heart Disease Prediction using Data Science!

Viv's, Exploring 360° ;)
13 min readMay 28, 2023

--

Hi! Well, in this post, I would like to share my experience on data analysis which is useful to predict HEART DISEASE from common set of core risk factors.

This analysis is one of my learning project since I want to enhance my analytical skill set. I would like to express my deep gratitude to Professor Satish K. Choudhary, for his patient guidance, enthusiastic encouragement and useful critiques.

According to the World Health Organization, every year 12 million+ deaths occur worldwide due to Heart Disease. Heart disease is one of the biggest causes of morbidity and mortality among the population of the world. Prediction of cardiovascular disease is regarded as one of the most important subjects in the section of data analysis.

The load of cardiovascular disease is rapidly increasing all over the world from the past few years. Many researches have been conducted in attempt to pinpoint the most influential factors of heart disease as well as accurately predict the overall risk. Heart Disease is even highlighted as a silent killer which leads to the death of the person without obvious symptoms. The early diagnosis of heart disease plays a vital role in making decisions on lifestyle changes in high-risk patients and in turn reduces the complications.

Even though heart disease can occur in different forms, there is a common set of core risk factors that influence whether someone will ultimately be at risk for heart disease or not. By collecting the data from various sources, classifying them under suitable headings & finally analysing to extract the desired data we can conclude. This technique can be very well adapted to the do the prediction of heart disease. As the well-known quote says “Prevention is better than cure”, early prediction & its control can be helpful to prevent & decrease the death rates due to heart disease.

Dataset by Heart Disease:

Dataset source: https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset

Dataset columns:

  • age: The person’s age in years
  • sex: The person’s sex (1 = male, 0 = female)
  • cp: chest pain type
    — Value 0: asymptomatic
    — Value 1: atypical angina
    — Value 2: non-anginal pain
    — Value 3: typical angina
  • trestbps: The person’s resting blood pressure (mm Hg on admission to the hospital)
  • chol: The person’s cholesterol measurement in mg/dl
  • fbs: The person’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
  • restecg: resting electrocardiographic results
    — Value 0: showing probable or definite left ventricular hypertrophy by Estes’ criteria
    — Value 1: normal
    — Value 2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
  • thalach: The person’s maximum heart rate achieved
  • exang: Exercise induced angina (1 = yes; 0 = no)
  • oldpeak: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot. See more here)
  • slope: the slope of the peak exercise ST segment — 0: downsloping; 1: flat; 2: upsloping
    0: downsloping; 1: flat; 2: upsloping
  • ca: The number of major vessels (0–3)
  • thal: A blood disorder called thalassemia Value 0: NULL (dropped from the dataset previously
    Value 1: fixed defect (no blood flow in some part of the heart)
    Value 2: normal blood flow
    Value 3: reversible defect (a blood flow is observed but it is not normal)
  • target: Heart disease (1 = no, 0= yes)

Importing Necessary Libraries

Plotting Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks as cf
%matplotlib inline

Metrics for Classification technique

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

Scaler

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV, train_test_split

Model building

from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

Data Loading

Here we will be using the pandas read_csv function to read the dataset. Specify the location of the dataset and import them.

Age(“age”) Analysis

Here we will be checking the 10 ages and their counts.

plt.figure(figsize=(25,12))
sns.set_context('notebook',font_scale = 1.5)
sns.barplot(x=data.age.value_counts()[:10].index,y=data.age.value_counts()[:10].values)
plt.tight_layout()

Output:

Inference: Here we can see that the 58 age column has the highest frequency.

Let’s check the range of age in the dataset.

minAge=min(data.age)
maxAge=max(data.age)
meanAge=data.age.mean()
print('Min Age :',minAge)
print('Max Age :',maxAge)
print('Mean Age :',meanAge)

Output:

Min Age : 29 Max Age : 77 Mean Age : 54.366336633663366

We should divide the Age feature into three parts — “Young”, “Middle” and “Elder”

Young = data[(data.age>=29)&(data.age<40)]
Middle = data[(data.age>=40)&(data.age<55)]
Elder = data[(data.age>55)]

Output:

Inference: Here we can see that elder people are the most affected by heart disease and young ones are the least affected.

To prove the above inference we will plot the pie chart.colors = ['blue','green','yellow']
explode = [0,0,0.1]
plt.figure(figsize=(10,10))
sns.set_context('notebook',font_scale = 1.2)
plt.pie([len(Young),len(Middle),len(Elder)],labels=['young ages','middle ages','elderly ages'],explode=explode,colors=colors, autopct='%1.1f%%')
plt.tight_layout()

Output:

Sex(“sex”) Feature Analysis

plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale = 1.5)
sns.countplot(data['sex'])
plt.tight_layout()

Output:

Inference: Here it is clearly visible that, Ratio of Male to Female is approx 2:1.

Now let’s plot the relation between sex and slope.

plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale = 1.5)
sns.countplot(data['sex'],hue=data["slope"])
plt.tight_layout()

Output:

Inference: Here it is clearly visible that the slope value is higher in the case of males(1).

Chest Pain Type(“cp”) Analysis

plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale = 1.5)
sns.countplot(data['cp'])
plt.tight_layout()

Output:

Inference: As seen, there are 4 types of chest pain

  1. status at least
  2. condition slightly distressed
  3. condition medium problem
  4. condition too bad

Analyzing cp vs target column

Inference: From the above graph we can make some inferences,

  • People having the least chest pain are not likely to have heart disease.
  • People having severe chest pain are likely to have heart disease.

Elderly people are more likely to have chest pain.

Thal Analysis

plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale = 1.5)
sns.countplot(data['thal'])
plt.tight_layout()

Output:

Target

plt.figure(figsize=(18,9))
sns.set_context('notebook',font_scale = 1.5)
sns.countplot(data['target'])
plt.tight_layout()

Output:

Inference: The ratio between 1 and 0 is much less than 1.5 which indicates that the target feature is not imbalanced. So for a balanced dataset, we can use accuracy_score as evaluation metrics for our model.

The working of the system starts with the collection of data and selecting the important attributes. Then the required data is preprocessed into the required format.

The data is then divided into two parts training and testing data. The algorithms are applied and the model is trained using the training data. The accuracy of the system is obtained by testing the system using the testing data. This system is implemented using the following modules.

  1. Collection of Dataset
  2. Selection of attributes
  3. Data Pre-Processing
  4. Balancing of Data
  5. Disease Prediction

Collection of dataset -

Initially, we collect a dataset for our heart disease prediction system. After the collection of the dataset, we split the dataset into training data and testing data. The training dataset is used for prediction model learning and testing data is used for evaluating the prediction model. For this project, 70% of training data is used and 30% of data is used for testing. The dataset used for this project is Heart Disease UCI. The dataset consists of 76 attributes; out of which, 14 attributes are used for the system.

Selection of attributes-

Attribute or Feature selection includes the selection of appropriate attributes for the prediction system. This is used to increase the efficiency of the system.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('/content/drive/My Drive/dataset/heart.csv')
df.head()

Various attributes of the patient like gender, chest pain type, fasting blood pressure, serum cholesterol, exang, etc are selected for the prediction. The Correlation matrix is used for attribute selection for this model

Correlations

Correlation Matrix- let’s you see correlations between all variables.

Within seconds, you can see whether something is positively or negatively correlated with our predictor (target).

We can see there is a positive correlation between chest pain (cp) & target (our predictor). This makes sense since, the greater amount of chest pain results in a greater chance of having heart disease. Cp (chest pain), is a ordinal feature with 4 values: Value 1: typical angina ,Value 2: atypical angina, Value 3: non-anginal pain , Value 4: asymptomatic.

In addition, we see a negative correlation between exercise induced angina (exang) & our predictor. This makes sense because when you excercise, your heart requires more blood, but narrowed arteries slow down blood flow.

Preprocessing Data

After getting the dataset, the next step that must be done is preprocessing text. The preprocessing text aims to clean raw data into assorted data that is ready for use. The data preprocessing process can consist of many things, changing the data type in the column, changing or manipulating empty columns, deleting data with duplicate contents, and others.

•    age: The person's age in years
• sex: The person's sex (1 = male, 0 = female)
• cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
• trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)
• chol: The person's cholesterol measurement in mg/dl
• fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
• restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
• thalach: The person's maximum heart rate achieved
• exang: Exercise induced angina (1 = yes; 0 = no)
• oldpeak: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)
• slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
• ca: The number of major vessels (0-3)
• thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
• target: Heart disease (0 = no, 1 = yes)

Preprocessing is important into transitioning raw data into a more desirable format. Undergoing the preprocessing process can help with completeness and compellability. For instance, you can see if certain values were recorded or not. Also, you can see how trustable the data is. It could also help with finding how consistent the values are. We need preprocessing because most real-world data are dirty. Data can be noisy i.e. the data can contain outliers or just errors in general. Data can also be incomplete i.e. there can be some missing values.

What Is Data Imbalance?

Data imbalance usually reflects an unequal distribution of classes within a dataset. For example, in a credit card fraud detection dataset, most of the credit card transactions are not fraud and a very few classes are fraud transactions. This leaves us with something like 50:1 ratio between the fraud and non-fraud classes.

Balancing of Data (Resampling Technique)-

Imbalanced datasets can be balanced in two ways. They are Under Sampling and Over Sampling.

(a) Under Sampling: In Under Sampling, dataset balance is done by the reduction of the size of the ample class. This process is considered when the amount of data is adequate.

(b) Over Sampling: In Over Sampling, dataset balance is done by increasing the size of the scarce samples. This process is considered when the amount of data is inadequate.

To code this in python, I use a library called imbalanced-learn or imblearn.

After undersampling the dataset, I plot it again and it shows an equal number of classes:

Learning paradigm-

● Supervised Learning is the type of machine learning in which machines are trained using well “labelled” training data, and on the basis of that data, machines predict the output. The labelled data means some input data is already tagged with the correct output. In supervised learning, the training data provided to the machines work as the supervisor that teaches the machines to predict the output correctly. It applies the same concept as a student learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to the machine learning model. The aim of a supervised learning algorithm is to find a mapping function to map the input variable(x) with the output variable(y).

● Unsupervised learning cannot be directly applied to a regression or classification problem because unlike supervised learning, we have the input data but no corresponding output data.

The goal of unsupervised learning is to find the underlying structure of dataset, group that data according to similarities, and represent that dataset in a compressed format.

•Unsupervised learning is helpful for finding useful insights from the data. •Unsupervised learning is much similar to how a human learns to think by their own experiences, which makes it closer to the real AI. • Unsupervised learning works on unlabeled and uncategorized data which make unsupervised learning more important. • In real-world, we do not always have input data with the corresponding output so to solve such cases, we need unsupervised learning.

Prediction of Disease -

Various machine learning algorithms like SVM, Naive Bayes, Decision Tree, Random Tree, Logistic Regression, Ada-boost, Xg-boost are used for classification. Comparative analysis is performed among algorithms and the algorithm that gives the highest accuracy is used for heart disease prediction.

We’re Using, LOGISTIC REGRESSION ALGORITHM Logistic regression is one of the most popular algorithms, which comes under the Supervised Learning technique.

It is used for predicting the categorical dependent variable using a given set of independent variables. Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.

Logistic Regression is much similar to the Linear Regression except that how they are used. Linear Regression is used for solving Regression problems, whereas logistic regression is used for solving the classification problems.

In Logistic regression, instead of fitting a regression line, we fit an “S”shaped logistic function, which predicts two maximum values (0 or 1). The curve from the logistic function indicates the likelihood of something such as whether the cells are cancerous or not, a mouse is obese or not based on itsweight, etc. Logistic Regression is a significant machine learning algorithm because it has the ability to provide probabilities and classify new data using continuous and discrete datasets.

XGBOOST ALGORITHM XG-boost is a type of Software library that was designed basically to improve speed and model performance. In this algorithm, decision trees are created in sequential form.

Weights play an important role in XG-boost. Weights are assigned to all the independent variables which are then fed into the decision tree which predicts results. Weight of variables predicted wrong by the tree is increased and these the variables are then fed to the second decision tree.

These individual classifiers/predictors then assemble to give a strong and more precise model. It can work on regression, classification, ranking, and user-defined predict.

Regularization: XG-boost has in-built L1 (Lasso Regression) and L2 (Ridge Regression) regularization which prevents the model from overfitting.

That is why, XG-boost is also called regularized form of GBM (Gradient Boosting Machine). While using Scikit Learn libarary, we pass two hyper-parameters (alpha and lambda) to XG-boost related to regularization.

alpha is used for L1 regularization and lambda is used for L2 regularization.

Parallel Processing: XG-boost utilizes the power of parallel processing and that is why it is much faster than GBM. It uses multiple CPU cores to execute the model. While using Scikit Learn libarary, nthread hyper-parameter is used for parallel processing. nthread represents number of CPU cores to be used. If you want to use all the available cores, don’t mention any value for nthread and the algorithm will detect automatically.

PERFORMANCE ANALYSIS In this project, various machine learning algorithms like Logistic Regression, XG-boost are used to predict heart disease.

The accuracy for individual algorithms has to measure and whichever algorithm is giving the best accuracy,that is considered for the heart disease prediction. For evaluating the experiment, various evaluation metrics like accuracy, confusion matrix, precision, recall, and f1-score are considered.

Accuracy- Accuracy is the ratio of the number of correct predictions to the total number of inputs in the dataset. It is expressed as: Accuracy = (TP + TN) /(TP+FP+FN+TN) Confusion Matrix- It gives us a matrix as output and gives the total performance of the system

RESULT

After performing the machine learning approach for training and testing we find that accuracy of the XG-boost is better compared to other algorithms. Accuracy is calculated with the support of the confusion matrix of each algorithm, here the number count of TP, TN, FP, FN is given and using the equation of accuracy, value has been calculated and it is concluded that extreme gradient boosting is best with 81% accuracy where as Logistic Regression’s accurace is 79.1%.

I will be very happy to discuss and accept any suggestions about the analysis since I’m still learning and still have a long way to go, please reach me through https://www.linkedin.com/in/thakurvivashwat/. Thank u so much!!!

References —

https://www.analyticsvidhya.com/blog/2022/02/heart-disease-prediction-using-machine-learning/

https://medium.com/towards-data-science/heart-disease-uci-diagnosis-prediction-b1943ee835a7

Stay Awake!
V.

--

--

Viv's, Exploring 360° ;)

Hey! I'm Vivashwat, Self-starter with a keen interest in R&D. Passionate to explore possibilities technology can unlock 🔓 .