Developing And Deploying A Heart Disease Classification App Using PyWebIo

Kamen Damov
7 min readJul 17, 2022

--

Introduction

Heart disease are the leading cause of death in the world. In the 2019, they presented 32% of global deaths (source: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)). In this present article, t I will go over the development process of a tool to predict if a patient is at a higher risk of a heart attack, using machine learning.

Data

Here’s the kaggle page of the data: https://www.kaggle.com/datasets/cherngs/heart-disease-cleveland-uci.

First let’s see the head of the data-frame:

All the values seem to be continuous, but some columns are categorical. The publisher of this dataset has documented the categorical columns, and the representation of each value. For instance, sex = 1 is male, sex = 0 is female, or for condition, 0 = less chance of a heart disease, and 1 = higher chance of a heart disease, and so on.

Data exploration

Let’s see if there are any missing data using the missingno library:

msno.matrix(df)

No missing data. Let’s now see how the data for continuous columns is distributed:

to_viz = ['trestbps', 'chol', 'thalach', 'oldpeak']
for v in to_viz:
sns.distplot(df[v])
plt.show()

The data is normally distributed for all the categories except for the oldpeak attribute.

Let’s see if we have a balanced sample for the y categories:

df.groupby('condition').count()

Fairly balanced! Let’s divide the data-set in categorical and numerical columns, to produce dummy columns through one hot encoding.

Feature Engineering

One hot encoding, in a categorical column, is the creation of one column per unique value in a categorical column. The presence of a certain value in a data point (or row) will have 1 as a value and 0 for the other categories:

Let’s divide categorical and numerical columns. As the categorical values just have an integer that represents a class, we will “hard code” the categories based on the information we have:

#Create dummies for categorical columns
catColumns=['sex','cp','fbs','restecg','exang','slope','ca','thal']
new_to_produce = []
for col in catColumns:
new_to_produce.append(pd.get_dummies(df[col], drop_first=False, prefix=col, dtype=int))
#create a full dataframe of dummies
dataLog = pd.concat(new_to_produce, axis = 1)

We now need to add the continuous attributes to our data-frame:

to_viz = ['trestbps', 'chol', 'thalach', 'oldpeak']
data = pd.concat([dataLog, df[to_viz]], axis=1)

Great! Our data is now ready for a classification.

Machine Learning

Let’s perform our train/test splits:

X = data.drop('condition', axis =1)
X.to_csv('dummy')
y = data['condition']
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.3,
stratify=y)

As you can see, I saved X data-frame to csv. This will be useful when developing the app, because we will just need to fit the input data to the columns of our X data-frame.

Let’s test out two classification models, and see which one performs best. We will create a pipeline which will first test out two classification models, tune the parameters, and re-fit the data to the tuned models. Here’s the code:

#Create a pipeline
classifiers = {
'rfc' : RandomForestClassifier(),
'log' : LogisticRegression()
}
for clas in classifiers:
print('Testing un-tuned : ', clas)
classifiers[clas].fit(X_train, y_train)
class_pred = classifiers[clas].predict(X_test)
print('Confusion Matrix')
print(confusion_matrix(y_test,class_pred))
print('Classification Report')
print(classification_report(y_test,class_pred))
print('Accuracy: ', accuracy_score(y_test,class_pred)*100)
print('Tuning the parameters')
if clas == 'rfc':
params = {
"n_estimators" : [2, 5, 10,100, 200],
"max_features" : ["auto", "sqrt", "log2"],
"max_depth" : [2, 4, 8, 12, 15],
"min_samples_split" : [2,4,8],
"bootstrap": [True, False],
}
grid_search = GridSearchCV(estimator = classifiers[clas],
param_grid = params,
cv = 2,
verbose=2)
grid_search.fit(X_train, y_train)
print("Best params RFC")
print(grid_search.best_params_)
bestestim = grid_search.best_estimator_
print('Test accuracy: %.3f' % bestestim.score(X_test, y_test))
else:
parameters = [{'penalty':['l1', 'l2', 'elasticnet', 'none']},
{'C': [0, 0.1, 0.01, 0.001, 1.0, 3.0, 6.0, 10.0, 50.0, 100.0]}]
grid_search = GridSearchCV(estimator = logmod,
param_grid = parameters,
scoring = 'accuracy',
cv = 10,
verbose=0)
grid_search.fit(X_train, y_train)
print('Best parameters logistic regression')
print(grid_search.best_params_)
bestestim = grid_search.best_estimator_
print('Test accuracy: %.3f' % bestestim.score(X_test, y_test))

Ouput for the un-tuned random forest classifier:

Best parameters after grid search:

Ouput for un-tuned logistic regression:

Best parameters to keep:

Now let’s test these models with the best parameters:

#Re-fitting with tuned parameters
classifiers = [
RandomForestClassifier(bootstrap = False, max_depth = 8, max_features = 'sqrt', min_samples_split = 8, n_estimators = 100),
LogisticRegression(C=6.0)
]
for clas in classifiers:print(clas)
clas.fit(X_train, y_train)
pred = clas.predict(X_test)
print('Confusion Matrix')
print(confusion_matrix(y_test,pred))
print('Classification Report')
print(classification_report(y_test,pred))
print('Accuracy: ', accuracy_score(y_test,pred)*100)

For Random Forest Classifier:

For logistic regression:

Logistic regression has scored better. A word on it.

Logistic regression (or sigmoid function), is often used for binary classification problems. Let’s explain this model using the graph below.

Much like a classical linear regression, the logistic regression has a line of best fit. That said, a distinct point on the line isn’t the output value, but rather the probability of a data point belonging to a binary category. Let’s say I studied 5 hours for a test, well the probability of passing the test should be around 0.75, or 0.8 if look at the graph. As 0.75>0.5 (0.5 being the classification threshold), I will be classified in the test passed category. Hurray! In the same manner, if I studied 2 hours, the probability of passing the test will be around 0.15. Given that 0.15<0.5, I will be classified in the failed category. Bummer.

Let’s pickle the logistic regression we trained, and tuned, and start developing our app interface.

#Pickle the model to be used in the app
pickle.dump(logmod, open('logmod.pkl', 'wb'))

App Development

We will use the PyWebIo web application python library to develop our heart disease app. Let’s first import the need libraries:

# WebApp starts here
import numpy as np
import pandas as pd
from pywebio.input import *
from pywebio.output import *
from pywebio.session import *
from pywebio.platform import *
from pywebio.platform.flask import webio_view
from pywebio import STATIC_PATH, start_server
from flask import Flask, send_from_directory
import argparse

Then, let’s import our model and initialize our app:

#load the model
import pickle
model = pickle.load(open('logmod.pkl', 'rb'))
app = Flask(__name__)

As mentioned above, let’s import the data-frame that includes the dummy columns. It will be used to add new data to it, and fit to the logistic regression model.

#To get the dummy columns
dummy = pd.read_csv('dummy')
dummy.drop('Unnamed: 0', axis = 1, inplace = True)
dummyCols = dummy.columns

Now, let’s create our data fields where the user can input data:

Given that our data-set contains only floats and integers, even for the categorical columns, we will need to create a more user friendly interface by initializing dictionaries to index data that is input. Here’s a sample of the created dictionaries:

Now let’s input the data from the user in an empty data-set:

catColumns=['sex','cp','fbs','restecg','exang','slope','ca','thal']
to_viz = ['trestbps', 'chol', 'thalach', 'oldpeak']
for i in to_viz:
catColumns.append(i)
input_df = pd.DataFrame(columns = catColumns)
input_df.loc[0] = input_data

Now, let’s create the dummy columns, to have an identical input as the training process.

#Creating dummy col
for col in catColumns:
new_input_to_produce.append(pd.get_dummies(input_df[col], drop_first=False, prefix=col, dtype=int))
new_data1 = pd.concat(new_input_to_produce, axis = 1)#Continuous columns to add
for v in to_viz:
new_input_to_produce2.append(input_df[v])
new_data2 = pd.concat(new_input_to_produce2, axis = 1)
#Creating the dataframe
final_data = pd.concat([new_data1, new_data2], axis = 1)
final_data = pd.concat([dummy, final_data])
final_data = final_data.fillna(0)

Super! Now we just need to predict on this new input, and present the output to the user.

if model.predict(final_data.tail(1).to_numpy()) == [1]:
popup("You have a higher risk of heart disease (accuracy: 91%)")
else:
popup("You do not have a heart disease (accuracy: 91%)")

Final step, there different deployment steps, depending on whether you want to deploy it on a server, or run it locally.

To run the app locally:

#Run app locally
app.add_url_rule('/WebApp','webio_view',webio_view(heart),
methods=['GET','POST','OPTIONS'])
app.run('localhost',port=80)
#http://localhost/WebApp
if __name__ == '__main__':
heart()

To deploy on a server:

#Deploy app in Heroku
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("-p","--port", type=int, default=8080)
args = parser.parse_args()
start_server(heart, port = args.port)

OK! Let’s see how the app looks like.

Nice! Here’s the link to test it out!

Thank you for reading! I hope you enjoyed it!

--

--

Kamen Damov

Mathematics and Computer Science student at University of Montreal.