Using Machine learning Algorithm in Heart Failure Clinical Records Dataset

Ammar J Alashhab
10 min readFeb 2, 2022

--

What is Heart Failure?

Heart disease (heart failure) develops when the muscles in the heart wall weaken and enlarge, reducing the heart’s ability to pump blood. It is possible for the heart’s ventricles to grow stiff and cease filling properly between beats. With the passage of time, the heart’s ability to supply enough blood to the body’s needs diminishes, and the individual begins to have breathing difficulties.

This study looked at the survival of heart failure patients hospitalized to the Institute of Cardiology and Allied Hospital in Faisalabad, Pakistan, between April and December (2015).

The aims of this project are separated into two sections which are the following:-

1) Theoretical Section:

  • What is the CRoss Industry Standard Process for Data Mining
  • Understand the concept of predictive analytics

2) Application Section:

  • Data Understanding
  • Data Preprocessing
  • Modeling (Two predictive analytics algorithm)
  • Evaluation

What is the CRoss Industry Standard Process?

The CRISP-DM (CRoss Industry Standard Process for Data Mining) is a six-phase process model that accurately captures the data science life cycle. The goal of this process was to standardize data mining procedures across sectors, and it was published in 1999. It functions as a series of guidelines to assist you in planning, organizing, and implementing your data science (or machine learning) project. However the six phases are Business understanding, Data understanding, Data preparation, Modeling, Evaluation, Deployment.

What is predictive analytics?

Predictive analytics is a subset of advanced analytics that uses historical data in conjunction with statistical modeling, data mining techniques, and machine learning to make predictions about future events. Predictive analytics is used by businesses to uncover patterns in this data in order to identify dangers and opportunities. There are several models available in predictive analytics, including the following:

1- Clustering model: This approach groups data based on its shared properties. It works by classifying objects or people based on similar characteristics or behaviors and then strategizing on a bigger scale for each group.

2- Time series model: This model is used to analyze a time series of data points. For instance, the number of stroke patients admitted to the hospital in the preceding four months is used to forecast the number of patients expected to be admitted next week, next month, or for the remainder of the year.

3- Classification models: is the process of determining the class of data points that have been provided. Classes are occasionally referred to as targets/labels or categories. Predictive modeling for classification is the process of estimating a mapping function (f) from discrete input variables (X) to discrete output variables (y).

The analytical models run one or more algorithms on the data set that will be used to make the forecast, depending on the model. Because it includes training the model, it is a time-consuming and repeated procedure. It is possible that multiple models will be employed on the same data set until one that is suitable for business objectives is discovered.

In the Application section, python programming language have been used in the Google Colab environment. According to used libraries, Pandas & Numpy used for read and manipulate data, Seaborn & Matplotlip for EDA plot, also for applying algorithms Sklearn is used.

Let’s start doing it step by step!

1- Data Understanding

As previously stated, this project utilized data from heart failure patients hospitalized to the Institute of Cardiology and Allied Hospital in Faisalabad, Pakistan, between April and December (2015). The data is provided in a CSV file that was obtained from the UCI Machine Learning Repository.

First of all we will import the needed libraries and read the dataset CSV file.

#Libraries Calling
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
data_0 = pd.read_csv("heart_failure_clinical_records_dataset.csv")
data_0.columns
data_0.shape
data_0.head()

The Dataset contains 13 columns and 299 row. Each instance in the dataset represents a patient and the attributes represent clinical features

Thirteen (13) clinical features

The heart failure clinical records dataset includes both numeric and categorical variables.

-let’s get overall statistics about dataset

data_0.describe().transpose()
Overall Statistics

-Visualize data help to give a better understanding of the dataset

Dashboard 1
Dashboard 2

By reviewing the previous chart we can get that:

  • The age distribution graph seems normally distributed with right-skewed, Also the smallest age is 40 years and the most frequent numbers are between 55–60
  • Proportion of anemia graph, the proportion of anemia uninfected is bigger than anemia infected with 57%
  • creatinine_phosphokinase distribution graph, the most frequent numbers are between 0–1000 and there are some outliers.
  • Proportion of diabetes graph, the proportion of Non diabetics is bigger than diabetics with 58%
  • Ejection_fraction distribution graph seems normally distributed, Also the most frequent values are between 30–40
  • High_blood_pressure chart,the high pressure is smaller than no high pressure where high pressure values are around 110 and no high pressure values are around 190
  • Platelets distribution graph, the most frequent numbers are between 200000–400000 and there are some outliers.
  • Serum_creatinine distribution graph, the most frequent numbers are between 0–2 and there are some outliers.
  • Serum_sodium distribution graph seems normally distributed with lift-skewed, the most frequent numbers are between 135–140
  • Proportion of sex graph, the proportion of women is bigger than men with 65%
  • Proportion of smoking graph, we can see that the proportion of non-smokers is bigger than smokers with 68%
  • Proportion of Death Event graph, the proportion of survival is bigger than died with 68%

2- Data Preprocessing

Prior to running any data mining algorithms, pre-processing is needed.

-Detecting Null Values

We can see that there is 15 null values in “ejection_fraction” column

data_0.isnull().sum()
Null Values

Here we will fill the null values with the mean value of that column (“ejection_fraction”)

data_0.replace(" ", np.nan, inplace = True)
avg_ejection_fraction=data_0['ejection_fraction'].astype('float').mean(axis=0)
print("Average of ejection_fraction:", avg_ejection_fraction)
data_0["ejection_fraction"].replace(np.nan, avg_ejection_fraction, inplace=True)

Now we can see that there is no null values anymore

Null Values After Filling

In the following, we applied some codes to build Heatmap to present the correlation between all the dataset attributes

# Correlation analysis (Heat Map)
corrMatt = data_0.corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(20,10)
sns.heatmap(corrMatt, mask=mask,vmax=.8, square=True,annot=True)
Heatmap

The heat map above shows that the largest correlation is between time (Follow-up period) and DEATH EVENT, with a value of -0.53(negative correlation). The serum creatinine, ejection fraction, and age follow, with correlation values of 0.29, 0.26, and 0.25 with the death event, respectively.

Here we are going to separate the data into Train & Test which is an important preprocessing step in our project.

Firstly we will split the inputs and the output. where the output is:”Death_event” column

#input
data_1 = data_0.drop('DEATH_EVENT', axis=1)
data_1 = np.array(data_1, dtype=int)
#output
target_1 = data_0['DEATH_EVENT']
target_1 = np.array(target_1, dtype=int)

Secondly Split the data to train and test where: 80% for Training & 20% for Testing.

#Calling some extra libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
# splitting the data for training and testing
x_train, x_test, y_train, y_test = train_test_split(data_1, target_1, test_size=0.2, random_state=4)

And here as a final step in data preprocessing we will normalize the dataset. The normalization has been done to make all the attribute values between zero and one (0–1) to reach better accuracy.

x_train = normalize(x_train)
x_test = normalize(x_test)

3- Modeling

Because the objective variable is categorical, classification is the best model for predicting our goal. In this part two classification algorithms will be applied

  • Applying K nearest neighbor (KNN)

First of all we tried K= 4 and the accuracy was = 0.75

# I will start the algorithm with k=4 for now:
k = 4
#Train Model and Predict
neigh = KNeighborsClassifier(n_neighbors = k).fit(x_train,y_train)
neigh
# use the model to make predictions on the test set
yhat = neigh.predict(x_test)
yhat[0:5]
# Accuracy evaluation
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(x_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

Then we used a code that displays the optimum K value for the best accuracy.

#checking the best number of k
Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
for n in range(1,Ks): #Train Model and Predict
neigh = KNeighborsClassifier(n_neighbors = n).fit(x_train,y_train)
yhat=neigh.predict(x_test)
mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])
mean_acc#Plot the model accuracy for a different number of neighbors
plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.fill_between(range(1,Ks),mean_acc - 3 * std_acc,mean_acc + 3 * std_acc, alpha=0.10,color="green")
plt.legend(('Accuracy ', '+/- 1xstd','+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()

From The graph above we can clearly see that K=5 gives the best accuracy where it is 78.33%

let’s Check that

#Checking the accuracy of k=5
k = 5
neigh6 = KNeighborsClassifier(n_neighbors = k).fit(x_train,y_train)
yhat6 = neigh6.predict(x_test)
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh6.predict(x_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat6))
  • Applying Naive Bayes Algorithm

We apply the following code for Naive Bayes classification

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)

Great.. Then the following code to get the accuracy

from sklearn.metrics import confusion_matrix,accuracy_score
ac = accuracy_score(y_test,y_pred)
print("Test set Accuracy: " ,ac)

So the accuracy of Naive Bayes algorithm is 71.66%

4- Evaluation

In order to evaluate the algorithms’ performance, we will create a confusion matrix and classification report for each method that has been applied to the dataset.

a) KNN Results

First of all we will apply the confusion matrix code for KNN

cnf_matrix = confusion_matrix(y_test, yhat6, labels=[1,0])
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['DEATH_EVENT=1','DEATH_EVENT=0'],normalize= False, title='Confusion matrix')

According to the confusion matrix graph above, the total projected value is 60, which is identical to the test value count.

The expected value for 18 of the values was 1 (died) and 42 values were projected to be 0 (Survival)

  • 8 of the 18 death values were predicted correctly, while 10 were incorrectly forecasted.
  • 39 of the 42 survival values were predicted correctly, while 3 were incorrectly forecasted.

Secondly we will build the classification report by write the following code

print (classification_report(y_test, yhat6))
  • Precision is a measure of the accuracy provided that a class label has been predicted. It is defined by precision = TP / (TP + FP)
  • Recall is the true positive rate. It is defined as: Recall = TP / (TP + FN).
  • The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

We can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 0.77 in our case

b) Naive Bayes Results

Again we will apply the confusion matrix code for Naive Bayes

cnf_matrix = confusion_matrix(y_test, y_pred, labels=[1,0])
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['DEATH_EVENT=1','DEATH_EVENT=0'],normalize= False, title='Confusion matrix')

According to the confusion matrix graph above, the total projected value is 60, which is identical to the test value count.

The expected value for 18 of the values was 1 (died) and 42 values were projected to be 0 (Survival)

  • 4 of the 18 death values were predicted correctly, while 14 were incorrectly forecasted.
  • 39 of the 42 survival values were predicted correctly, while 3 were incorrectly forecasted.

Then the Classification report for Naive Bayes

print (classification_report(y_test, y_pred))

We can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 0.67 in our case

So As conclusion..

In this study, we worked with a dataset of Heart Failure Clinical Records and follow CRISP-DM processes to complete the work. Additionally, we used two classification algorithms to predict the occurrence of a death. The average accuracy of the KNN algorithms is 77 percent, whereas the average accuracy of the naive bayes algorithm is 67 percent. As a consequence, we may conclude that the KNN algorithm produces better outcomes and more accurate predictions.

--

--