How to Predict Coronary Heart Disease Risk using Logistic Regression?

Nuzulul Khairu Nissa
Analytics Vidhya
Published in
11 min readApr 17, 2021

Heart disease is the leading cause of death worldwide, accounting for one third of deaths in 2019. Heart disease cases nearly doubled over the period, from 271 million in 1990 to 523 million in 2019, and the number of heart disease deaths rose from 12.1 million to 18.6 million.

Coronary Heart Disease (CHD) involves the reduction of blood flow to heart muscle due to build-up of plaque in the arteries of the heart.

One of the most challenging task is to identify the causes of this disease and prevent it to the extent possible. Medical diagnostic reasoning is becoming popular application of Machine Learning in which expert systems and model-based schemes provide mechanisms used for the development of hypotheses, which will be then tested using the modelling and simulation techniques or statistical analysis.

The prediction of heart disease is considered one of the most important topics in health domain. With the machine learning algorithms and having large amounts of data, it is possible to extrapolate information that can help doctors make more accurate predictions.

Prediction of CHD is a much complex challenge considering the level of expertise and knowledge required for accurate result. According to a survey by WHO, medical professionals can correctly predict heart disease with only 67% accuracy.

In this article, a number of independent variables such as sex, age, cigsPerDay, totChol, sysBP and glucose will be used along with a dependent variable (TenYearCHD class) during the training phase to build a classification model. The classification goal is to predict whether the patient has 10-year risk of future Coronary Heart Disease (CHD) or not.

Okay, let’s get started!

Step 1. Importing Required Libraries

import pandas as pd
import numpy as np
import statsmodels.api as sm
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn.metrics import confusion_matrix
import matplotlib.mlab as mlab
%matplotlib inline

Step 2. Data Preparation

The dataset comes from the Kaggle website (Framingham Heart study dataset). The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes as mention below.

df = pd.read_csv('framingham.csv')
df.drop(['education'],axis=1,inplace=True)
df.rename(columns={'male':'sex_male'},inplace=True)
df.head()

Input variables:

  1. sex : male or female (Nominal)
  2. age : age of the patient (Continuous)
  3. currentSmoker : whether or not the patient is a current smoker (Nominal)
  4. cigsPerDay : the number of cigarettes that the person smoked on average in one day (Continuous)
  5. BPMeds : whether or not the patient was on blood pressure medication (Nominal)
  6. prevalentStroke : whether or not the patient had previously had a stroke (Nominal)
  7. prevalentHyp : whether or not the patient was Hypertension (Nominal)
  8. diabetes : whether or not the patient had diabetes (Nominal)
  9. totChol : total cholesterol level (Continuous)
  10. sysBP : systolic blood pressure (Continuous)
  11. diaBP : diastolic blood pressure (Continuous)
  12. BMI : Body Mass Index (Continuous)
  13. heartRate : heart rate (Continuous)
  14. glucose : glucose level (Continuous)
  15. TenYearCHD : 10 year risk of Coronary Heart Disease (CHD) (binary: 1 (Yes), 0 (No))

Step 3. Dealing with the Missing Values

df.isnull().sum()
count = 0
for i in df.isnull().sum(axis=1):
if i > 0:
count = count+1
print('Total number of rows with missing values is', count)

From the output above we got the total number of rows with missing value is 489. In this case, since it is only 12 % of the entire dataset, the rows with missing values are excluded.

We get the data from the excluding process like the output below:

df.dropna(axis=0, inplace=True)
df.info()

Step 4. Exploratory Visual Analysis

The purpose of this step is to see the distribution of each data.

def draw_histograms(dataframe, features, rows, cols):
fig=plt.figure(figsize=(20,20))
for i, feature in enumerate(features):
ax=fig.add_subplot(rows,cols,i+1)
dataframe[feature].hist(bins=20,ax=ax,facecolor='red')
ax.set_title(feature+"Distribution", color='blue')
fig.tight_layout()
plt.show()
draw_histograms(df, df.columns, 6, 3)

From the histogram above, we can understand better the distribution of the data.

The next step is to see how many patients are at risk of having Coronary Heart Disease or not.

sn.countplot(x='TenYearCHD', data=df)

From the figure above, we can conclude if there are 3179 patients with no Coronary Heart Disease and 572 patients with the risk of Coronary Heart Disease.

Step 5. Let’s Modelling!

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. It’s an extension of the linear regression model for classification problems. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

The first step that we are going to do, in terms of doing the modeling process is to add a constant value to the cleaned dataset.

from statsmodels.tools import add_constant as add_constant
df_constant = add_constant(df)
df_constant.head()

And for the next step, we’re trying to see the parameter and the result of our logistic regression model:

st.chisqprob = lambda chisq, df: st.chi2.sf(chisq, df)
cols = df_constant.columns[:-1]
model = sm.Logit(df.TenYearCHD, df_constant[cols])
result = model.fit()
result.summary()

The result above, show some of the attributes with a P-value higher than the preferred alpha (5%), which suggests that the attributes have a low statistically significant relationship with the probability of Coronary Heart Disease (CHD).

To solve that, we use a Backward elemination method approach for removing those attributes with the highest P-value one at a time followed by running the regression repeatedly until all attributes have P-values less than 0.05.

Feature Selection : Backward Elemination (P-value Approach)

The dependent variable and a list of column names, runs the regression repeatedly eleminating feature with P-value above alpha (5%) one at a time and returns the regression summary with all p-values below alpha (5%).

def back_feature_elem (data_frame, dep_var, col_list):
while len(col_list)>0 :
model = sm.Logit(dep_var,data_frame[col_list])
result = model.fit(disp=0)
largest_pvalue = round(result.pvalues,3).nlargest(1)
if largest_pvalue[0]<(0.05):
return result
break
else:
col_list = col_list.drop(largest_pvalue.index)
result = back_feature_elem(df_constant, df.TenYearCHD, cols)
result.summary()

At the end, we get the attributes which p-values is less then alpha (5%) :

  • sex
  • age
  • cigsPerDay
  • totChol
  • sysBP
  • glucose

and we get this Logistic Regression equation:

Step 6. Interpreting: Odds Ratio, Confidence Intervals and P-Values

The next step, we’re going to interpret the Odds Ratio, Confidence Intervals and P-values, using these code:

params = np.exp(result.params)
conf = np.exp(result.conf_int())
conf['OR'] = params
pvalue = round(result.pvalues,3)
conf['pvalue'] = pvalue
conf.columns = ['CI 95%(2.5%)','CI 95%(97.5%)', 'Odds Ratio', 'pvalue']
print((conf))

The interpretation of the output above are:

  • The odds of getting diagnosed with Coronary Heart Disease (CHD) for males (sex=1) over that of females (sex=0) is exp(0.5815) = 1.788687. In terms of percent change, we can say that the odds for males getting diagnosed are 78.8% higher than the odds for females.
  • The age coefficient says, if we want to look at the likelihood of being diagnosed with CHD, by looking at the parameter of one year increase in age, it is about exp(0.0655) = 1.067644 or its around 7%.
  • Every extra cigarette one smokes, there is a 2% increase (exp(0.0197)= 1.019895) in the odds of getting diagnosed of CDH.
  • There is a 1.7% increase (exp(0.0174)=1.017552) in odds for every unit increase in systolic Blood Pressure.
  • There is 0.7% increase (exp(0.0076)=1.007628) in the odds of getting diagnosed of CDH for every unit increase in glucose level.

Step 7. Splitting Data : Train and Test

import sklearn
new_features = df[['age','sex','cigsPerDay','totChol','sysBP','glucose','TenYearCHD']]
x = new_features.iloc[:,:-1]
y = new_features.iloc[:,-1]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20, random_state=5)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)

Step 8. Model Evaluation

Model evaluation aims to estimate the generalization accuracy of a model on future (unseen/out-of-sample) data.

1. Accuracy

Accuracy desccribes how accurately a model can classify correctly. It’s the number of correct predictions made as a ratio of all predictions made. We use sklearn module to compute the accuracy of a classification task.

sklearn.metrics.accuracy_score(y_test,y_pred)

From that output, we can interpret if the accuracy of the model that we get is about 87.5% on the validation set.

2. Confusion Matrix

The confusion matrix detects the count of TP (True Positive), TN (True Negative), FP (False Positive), FN (False Negative) in the predictions of a classifier.

From Confusion matrix we can derive the Accuracy which is given by the sum of the corrected predictions divided by the total number of predictions:

  • Accuracy = TP+TN/TP+FP+FN+TN

Also another standard performance measures:

  • Sensitivity or Recall = TP / TP + FN or True-Positive-Rate (TPR)
  • Specificity = TN / TN + FP or True-Negative-Rate (TNR)
  • Precision = TP / TP + FP or Positive Predicted Value
  • False-Positive-Rate (FPR) = FP / FP + TN
  • False-Negative-Rate (FNR) = FN / FN + TP
  • F1 Score = 2*(Recall * Precision) / (Recall + Precision)

For good classifiers, TPR and TNR both should be nearer to 100%. Similar is the case with precision and accuracy parameters. On the contrary, FPR and FNR both should be as close to 0% as possible.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
conf_matrix = pd.DataFrame(data=cm, columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])
plt.figure(figsize = (8,5))
sn.heatmap(conf_matrix, annot=True, fmt='d', cmap='YlGnBu')

The confusion matrix above shows 652+5 = 657 correct predictions and 87+5= 92 incorrect ones.

  • True Positives (TP) : 5
  • True Negatives (TN) : 652
  • False Positives (FP) : 7 (Type I error)
  • False Negatives (FN) : 87 (Type II error)
TN = cm[0,0]
TP = cm[1,1]
FN = cm[1,0]
FP = cm[0,1]
sensitivity = TP/float(TP+FN)
specificity = TN/float(TN+FP)
print('The acuuracy of the model = TP+TN/(TP+TN+FP+FN) = ',(TP+TN)/float(TP+TN+FP+FN),'\n','The Missclassification = 1-Accuracy = ',1-((TP+TN)/float(TP+TN+FP+FN)),'\n','Sensitivity or True Positive Rate = TP/(TP+FN) = ',TP/float(TP+FN),'\n','Specificity or True Negative Rate = TN/(TN+FP) = ',TN/float(TN+FP),'\n','Positive Predictive value = TP/(TP+FP) = ',TP/float(TP+FP),'\n','Negative Predictive Value = TN/(TN+FN) = ',TN/float(TN+FN),'\n','Positive Likelihood Ratio = Sensitivity/(1-Specificity) = ',sensitivity/(1-specificity),'\n','Negative Likelihood Ratio = (1-Sensitivity)/Specificity = ',(1-sensitivity)/specificity)
  1. The accuracy of the model = 87.48 %
  2. The Missclassification = 12.52 %
  3. Sensitivity or True-Positive-Rate (TPR) = 54.35 %
  4. Specificity or True-Negative-Rate (TNR) = 98.94 %
  5. Precision or Positive Predicted Value = 41.67 %
  6. Negative Predicted Value = 88.23 %
  7. Positive Likelihood Ratio = 5.12
  8. Negative Likelihood Ratio = 0.96

From the above statistics it is clear that the model is highly specific than sensitive. The negative values are predicted more accurately than the positives.

Step 9. Predicted Probabilities

0 (Coronary Heart Disease: No) and 1 (Coronary Heart Disease: Yes) for the test data with a default classification threshold of 0.5

y_pred_prob = logreg.predict_proba(x_test)[:,:]
y_pred_prob_df = pd.DataFrame(data=y_pred_prob, columns=['Prob of no hearts disease(0)','Prob of Heart Disease (1)'])
y_pred_prob_df.head()

Lower Threshold

From the confusion matrix, we can conclude that if the number of False Negatives (FN) (Type II error) is very large, it can be classified as moderately dangerous because, it means ignoring the probability of disease when there actualy is one OR True.

Hence inorder to increase the sensitivity, threshold can be lowered.

from sklearn.preprocessing import binarize
for i in range(1,5):
cm2=0
y_pred_prob_yes=logreg.predict_proba(x_test)
y_pred2=binarize(y_pred_prob_yes,i/10)[:,1]
cm2=confusion_matrix(y_test,y_pred2)
print ('With',i/10,'threshold the Confusion Matrix is ','\n',cm2,'\n',
'with',cm2[0,0]+cm2[1,1],'correct predictions and',cm2[1,0],'Type II errors( False Negatives)','\n\n',
'Sensitivity: ',cm2[1,1]/(float(cm2[1,1]+cm2[1,0])),'Specificity: ',cm2[0,0]/(float(cm2[0,0]+cm2[0,1])),'\n\n\n')

3. ROC Curve

Area under ROC curve is a performance metric for measuring the ability of a binary classifier to discriminate between positive and negative classes. ROC is a probability curve and AUC represents a measure of separability. It tells how much model is capable of distinguishing between classes.

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob_yes[:,1])
plt.plot(fpr,tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Heart disease classifier')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')
plt.grid(True)

Area Under the Curve (AUC)

The area under the ROC curve quantifies model classification accuracy; the higher the area, the greater the disparity between true and false positives, and the stronger the model in classifying members of the training dataset. An area of 0.5 corresponds to a model that performs no better than random classification and a good classifier stays as far away from that as possible. An area of 1 is ideal. The closer the AUC to 1 the better.

sklearn.metrics.roc_auc_score(y_test,y_pred_prob_yes[:,1])

Conclusion:

  • All attributes selected after the elimination process show P-values lower than 5% and thereby suggesting the attributes that have been selected have significant role in the Coronary Heart Disease (CHD) prediction.
  • Men seem to be more susceptible to heart disease than women. Increase in Age, number of cigarettes smoked per day and systolic Blood Pressure also show increasing odds of having heart disease.
  • The model predicted with 87.5% accuracy. The model is more specific than sensitive.
  • The Area under the ROC curve is 73.5 which is somewhat satisfactory.
  • Overall model could be improved with more data.

References:

--

--