Heart Disease — Logistic Regression and Other Classifications (Machine Learning)
INTRODUCTION
We have a data which classified if patients have heart disease or not according to features in it. We will try to use this data to create a model which tries predict if a patient has this disease or not. We will use logistic regression (classification) algorithm.
# Let's import libraries we will use
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
Read Data
# We are reading our data
df = pd.read_csv("../input/heart.csv")# First 5 rows of our data
df.head()
Data contains;
*age — age in years
* sex — (1 = male; 0 = female)
* cp — chest pain type
* trestbps — resting blood pressure (in mm Hg on admission to the hospital)
* chol — serum cholestoral in mg/dl
* fbs — (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* restecg — resting electrocardiographic results
* thalach — maximum heart rate achieved
* exang — exercise induced angina (1 = yes; 0 = no)
* oldpeak — ST depression induced by exercise relative to rest
* slope — the slope of the peak exercise ST segment
* ca — number of major vessels (0–3) colored by flourosopy
* thal — 3 = normal; 6 = fixed defect; 7 = reversable defect
* target — have disease or not (1=yes, 0=no)
DATA EXPLORATION
df.target.value_counts()
sns.countplot(x="target", data=df, palette="bwr")
plt.show()
countNoDisease = len(df[df.target == 0])
countHaveDisease = len(df[df.target == 1])
print("Percentage of Patients Haven't Heart Disease: {:.2f}%".format((countNoDisease / (len(df.target))*100)))
print("Percentage of Patients Have Heart Disease: {:.2f}%".format((countHaveDisease / (len(df.target))*100)))
sns.countplot(x='sex', data=df, palette="mako_r")
plt.xlabel("Sex (0 = female, 1= male)")
plt.show()
countFemale = len(df[df.sex == 0])
countMale = len(df[df.sex == 1])
print("Percentage of Female Patients: {:.2f}%".format((countFemale / (len(df.sex))*100)))
print("Percentage of Male Patients: {:.2f}%".format((countMale / (len(df.sex))*100)))
df.groupby('target').mean()
pd.crosstab(df.age,df.target).plot(kind="bar",figsize=(20,6))
plt.title('Heart Disease Frequency for Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.savefig('heartDiseaseAndAges.png')
plt.show()
pd.crosstab(df.sex,df.target).plot(kind="bar",figsize=(15,6),color=['#1CA53B','#AA1111' ])
plt.title('Heart Disease Frequency for Sex')
plt.xlabel('Sex (0 = Female, 1 = Male)')
plt.xticks(rotation=0)
plt.legend(["Haven't Disease", "Have Disease"])
plt.ylabel('Frequency')
plt.show()
plt.scatter(x=df.age[df.target==1], y=df.thalach[(df.target==1)], c="red")
plt.scatter(x=df.age[df.target==0], y=df.thalach[(df.target==0)])
plt.legend(["Disease", "Not Disease"])
plt.xlabel("Age")
plt.ylabel("Maximum Heart Rate")
plt.show()
pd.crosstab(df.slope,df.target).plot(kind="bar",figsize=(15,6),color=['#DAF7A6','#FF5733' ])
plt.title('Heart Disease Frequency for Slope')
plt.xlabel('The Slope of The Peak Exercise ST Segment ')
plt.xticks(rotation = 0)
plt.ylabel('Frequency')
plt.show()
pd.crosstab(df.fbs,df.target).plot(kind="bar",figsize=(15,6),color=['#FFC300','#581845' ])
plt.title('Heart Disease Frequency According To FBS')
plt.xlabel('FBS - (Fasting Blood Sugar > 120 mg/dl) (1 = true; 0 = false)')
plt.xticks(rotation = 0)
plt.legend(["Haven't Disease", "Have Disease"])
plt.ylabel('Frequency of Disease or Not')
plt.show()
pd.crosstab(df.cp,df.target).plot(kind="bar",figsize=(15,6),color=['#11A5AA','#AA1190' ])
plt.title('Heart Disease Frequency According To Chest Pain Type')
plt.xlabel('Chest Pain Type')
plt.xticks(rotation = 0)
plt.ylabel('Frequency of Disease or Not')
plt.show()
Creating Dummy Variables
Since ‘cp’, ‘thal’ and ‘slope’ are categorical variables we’ll turn them into dummy variables.
a = pd.get_dummies(df['cp'], prefix = "cp")
b = pd.get_dummies(df['thal'], prefix = "thal")
c = pd.get_dummies(df['slope'], prefix = "slope")frames = [df, a, b, c]
df = pd.concat(frames, axis = 1)
df = df.drop(columns = ['cp', 'thal', 'slope'])
CREATING MODEL FOR LOGISTIC REGRESSION
We can use sklearn library or we can write functions ourselves. Let’s them both. Firstly we will write our functions after that we’ll use sklearn library to calculate score.
y = df.target.values
x_data = df.drop(['target'], axis = 1)
Normalize Data
# Normalize
x = (x_data - np.min(x_data)) / (np.max(x_data) - np.min(x_data)).values
We will split our data. 80% of our data will be train data and 20% of it will be test data.
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2,random_state=0)#transpose matrices
x_train = x_train.T
y_train = y_train.T
x_test = x_test.T
y_test = y_test.T
Let’s say weight = 0.01 and bias = 0.0
#initialize
def initialize(dimension):
weight = np.full((dimension,1),0.01)
bias = 0.0
return weight,bias
Sigmoid Function
def sigmoid(z):
y_head = 1/(1+ np.exp(-z))
return y_head
Forward and Backward Propagation
Cost Function
Gradient
By the way in formulas;
- h0(x^i)= y_head
* y^i = y_train
* x^i = x_train
def forwardBackward(weight,bias,x_train,y_train):
# Forward
y_head = sigmoid(np.dot(weight.T,x_train) + bias)
loss = -(y_train*np.log(y_head) + (1-y_train)*np.log(1-y_head))
cost = np.sum(loss) / x_train.shape[1]
# Backward
derivative_weight = np.dot(x_train,((y_head-y_train).T))/x_train.shape[1]
derivative_bias = np.sum(y_head-y_train)/x_train.shape[1]
gradients = {"Derivative Weight" : derivative_weight, "Derivative Bias" : derivative_bias}
return cost,gradientsdef update(weight,bias,x_train,y_train,learningRate,iteration):
costList = []
index = []
#for each iteration, update weight and bias values
for i in range(iteration):
cost,gradients = forwardBackward(weight,bias,x_train,y_train)
weight = weight - learningRate * gradients["Derivative Weight"]
bias = bias - learningRate * gradients["Derivative Bias"]
costList.append(cost)
index.append(i)parameters = {"weight": weight,"bias": bias}
print("iteration:",iteration)
print("cost:",cost)plt.plot(index,costList)
plt.xlabel("Number of Iteration")
plt.ylabel("Cost")
plt.show()return parameters, gradientsdef predict(weight,bias,x_test):
z = np.dot(weight.T,x_test) + bias
y_head = sigmoid(z)y_prediction = np.zeros((1,x_test.shape[1]))
for i in range(y_head.shape[1]):
if y_head[0,i] <= 0.5:
y_prediction[0,i] = 0
else:
y_prediction[0,i] = 1
return y_predictiondef logistic_regression(x_train,y_train,x_test,y_test,learningRate,iteration):
dimension = x_train.shape[0]
weight,bias = initialize(dimension)
parameters, gradients = update(weight,bias,x_train,y_train,learningRate,iteration)y_prediction = predict(parameters["weight"],parameters["bias"],x_test)print("Manuel Test Accuracy: {:.2f}%".format((100 - np.mean(np.abs(y_prediction - y_test))*100)/100*100))logistic_regression(x_train,y_train,x_test,y_test,1,100)
Manuel Test Accuracy: 86.89%
Let’s find out sklearn’s score.
SKLEARN LOGISTIC REGRESSION
lr = LogisticRegression()
lr.fit(x_train.T,y_train.T)
print("Test Accuracy {:.2f}%".format(lr.score(x_test.T,y_test.T)*100))
K-Nearest Neighbour (KNN) Classification
Let’s see what will be score if we use KNN algorithm.
# KNN Model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 2) # n_neighbors means k
knn.fit(x_train.T, y_train.T)
prediction = knn.predict(x_test.T)print("{} NN Score: {:.2f}%".format(2, knn.score(x_test.T, y_test.T)*100))
# try ro find best k value
scoreList = []
for i in range(1,20):
knn2 = KNeighborsClassifier(n_neighbors = i) # n_neighbors means k
knn2.fit(x_train.T, y_train.T)
scoreList.append(knn2.score(x_test.T, y_test.T))
plt.plot(range(1,20), scoreList)
plt.xticks(np.arange(1,20,1))
plt.xlabel("K value")
plt.ylabel("Score")
plt.show()print("Maximum KNN Score is {:.2f}%".format((max(scoreList))*100))
Support Vector Machine (SVM) Algorithm
Now we’ll use SVM algorithm.
from sklearn.svm import SVCsvm = SVC(random_state = 1)
svm.fit(x_train.T, y_train.T)print("Test Accuracy of SVM Algorithm: {:.2f}%".format(svm.score(x_test.T,y_test.T)*100))
Naive Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train.T, y_train.T)
print("Accuracy of Naive Bayes: {:.2f}%".format(nb.score(x_test.T,y_test.T)*100))
Decision Tree Algorithm
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train.T, y_train.T)
print("Decision Tree Test Accuracy {:.2f}%".format(dtc.score(x_test.T, y_test.T)*100))
Random Forest Classifications
# Random Forest Classification
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 1000, random_state = 1)
rf.fit(x_train.T, y_train.T)
print("Random Forest Algorithm Accuracy Score : {:.2f}%".format(rf.score(x_test.T,y_test.T)*100))
Comparing Models
methods = ["Logistic Regression", "KNN", "SVM", "Naive Bayes", "Decision Tree", "Random Forest"]
accuracy = [85.25, 90.16, 81.97, 85.25, 75.41, 85.25]
colors = ["purple", "green", "orange", "magenta","#CFC60E","#0FBBAE"]sns.set_style("whitegrid")
plt.figure(figsize=(16,5))
plt.yticks(np.arange(0,100,10))
plt.ylabel("Accuracy %")
plt.xlabel("Algorithms")
sns.barplot(x=methods, y=accuracy, palette=colors)
plt.show()
Our models work fine but best of them are KNN and Random Forest with 88.52% of accuracy. Let’s look their confusion matrixes.
Confusion Matrix
# Predicted values
y_head_lr = lr.predict(x_test.T)
knn3 = KNeighborsClassifier(n_neighbors = 3)
knn3.fit(x_train.T, y_train.T)
y_head_knn = knn3.predict(x_test.T)
y_head_svm = svm.predict(x_test.T)
y_head_nb = nb.predict(x_test.T)
y_head_dtc = dtc.predict(x_test.T)
y_head_rf = rf.predict(x_test.T)from sklearn.metrics import confusion_matrixcm_lr = confusion_matrix(y_test,y_head_lr)
cm_knn = confusion_matrix(y_test,y_head_knn)
cm_svm = confusion_matrix(y_test,y_head_svm)
cm_nb = confusion_matrix(y_test,y_head_nb)
cm_dtc = confusion_matrix(y_test,y_head_dtc)
cm_rf = confusion_matrix(y_test,y_head_rf)plt.figure(figsize=(24,12))plt.suptitle("Confusion Matrixes",fontsize=24)
plt.subplots_adjust(wspace = 0.4, hspace= 0.4)plt.subplot(2,3,1)
plt.title("Logistic Regression Confusion Matrix")
sns.heatmap(cm_lr,annot=True,cmap="Blues",fmt="d",cbar=False)plt.subplot(2,3,2)
plt.title("K Nearest Neighbors Confusion Matrix")
sns.heatmap(cm_knn,annot=True,cmap="Blues",fmt="d",cbar=False)plt.subplot(2,3,3)
plt.title("Support Vector Machine Confusion Matrix")
sns.heatmap(cm_svm,annot=True,cmap="Blues",fmt="d",cbar=False)plt.subplot(2,3,4)
plt.title("Naive Bayes Confusion Matrix")
sns.heatmap(cm_nb,annot=True,cmap="Blues",fmt="d",cbar=False)plt.subplot(2,3,5)
plt.title("Decision Tree Classifier Confusion Matrix")
sns.heatmap(cm_dtc,annot=True,cmap="Blues",fmt="d",cbar=False)plt.subplot(2,3,6)
plt.title("Random Forest Confusion Matrix")
sns.heatmap(cm_rf,annot=True,cmap="Blues",fmt="d",cbar=False)plt.show()
Thanks for your time!
LinkedIn: https://www.linkedin.com/in/canerdabakoglu/
GitHub: https://github.com/cdabakoglu