Using current health information such as cholesterol, blood pressure, diabetes, age, and gender to predict the possibility of heart disease in ten years
Introduction
Heart disease is the leading cause of death in United States at the rate of 1 death every 33 seconds. Therefore, identifying what are the health factors that might increase the risk of developing heart disease is crucial in helping to prevent it from even happening in the first place. In this case, we’ll consider several health status such as age, gender, cholesterol level, blood pressure, and diabetes status.
Logistic regression is a method used to estimate the probability of an event occurring, heart disease in this case, based on a given dataset containing independent variables. Best model selection will be done to evaluate whether we actually need all those predictors, or a simpler model might even provide a better fit. This selection will be based on Akaike Information Criterion (AIC) where lower score indicates the better option.
Data Overview
Data is obtained from Kaggle, a total of 6 predictors will be taken which includes age, gender, cholesterol level, diabetes, systolic blood pressure, and diastolic blood pressure. The response variable will be the ten year heart disease risk. The gender, diabetes, and ten year heart disease will be in binary (1=male 0=female, and 1=yes 0=no) while the rest of the variables are in numerical values. Let’s visualize how each of these variables affects the heart disease risk.
Looking through the graphs, it seems like all 6 variables actually do make a difference in heart disease risk, although some more than others. The group of people predicted to have heart disease (TenYearCHD = 1) tend to be older, have higher cholesterol level, and blood pressure. Next, independence test can be done for the categorical predictors (diabetes and gender) to evaluate whether the response variable and predictor are independent or not.
crosstab_CHD_vs_diabetes = pd.crosstab(data['diabetes'],
data['TenYearCHD'],
margins=False)
print(crosstab_CHD_vs_diabetes)
# Result of independence test
result_diabetes = stats.chi2_contingency(crosstab_CHD_vs_diabetes)
# Extract the test result_diabetes
stat = result_diabetes[0]
pval = result_diabetes[1]
print(f"Z-stat : {stat:.4f}")
print(f"p-value : {pval:.4f}")
crosstab_CHD_vs_gender = pd.crosstab(data['gender'],
data['TenYearCHD'],
margins=False)
print(crosstab_CHD_vs_gender)
# Result of independence test
result_gender = stats.chi2_contingency(crosstab_CHD_vs_gender)
# Extract the test result_diabetes
stat2 = result_gender[0]
pval2 = result_gender[1]
print(f"Z-stat : {stat2:.4f}")
print(f"p-value : {pval2:.4f}")
The null hypothesis for independence test is that the variables are independent and the alternative hypothesis is that they are not independent. Since the p-value calculated for both categorical variables are found to be less than 0.05, null hypothesis can be rejected. This means that heart disease risk is dependent on gender and diabetes.
Model Fitting and Evaluation
K-fold cross validation will be used to evaluate the models. This method will divide the dataset into k number of groups where one of the groups will be used as test data and the rest are train data. The procedure will be repeated until all of the groups have been used once as test data.
Afterwards, we’ll use forward stepwise selection to choose the best model based on the lowest AIC score. This method start by calculating AIC score with only intercept and then add on the predictors one by one. The results:
The lowest AIC is found when there are only 3 predictors, with index 0, 2, and 4. In our case, it is the age, sysBP, and gender.
X_best = X[:, [0, 2, 4]]
X_best = sm.add_constant(X_best)
best_model = sm.Logit(y, X_best)
best_model_result = best_model.fit()
To interpret the model, the odds ratio can be calculated:
# coef for age as predictor
b1 = best_model_result.params[1]
# coef for sysBP as predictor
b2 = best_model_result.params[2]
# coef for gender as predictor
b3 = best_model_result.params[3]
odds_ratio_age = np.exp(b1)
print(f"OR for additional 1 year in age = {odds_ratio_age}")
odds_ratio_sysBP = np.exp(b2*10)
print(f"OR for additional 10 unit in sysBP = {odds_ratio_sysBP}")
odds_ratio_gender = np.exp(b3)
print(f"OR for male = {odds_ratio_gender}")
Based on the odds ratio, it can be interpreted that heart disease risk increase by 6.45% for every additional 1 year of age, by 20.55% for every 10 units increase in systolic blood pressure, and by 99.16% for male compared to female.
Now that the best model has been selected, its predictive performance needs to be evaluated based on the confusion matrix.
model_matrix = best_model_result.pred_table(threshold=0.5)
TP = model_matrix[1][1]
TN = model_matrix[0][0]
FP = model_matrix[0][1]
FN = model_matrix[1][0]
print(f'There are {TP} true positive, {TN} true negative, {FP} false positive, and {FN} false negative')
accuracy = (TP+TN)/(TP+FP+TN+FN)
sensitivity = TP / (TP + FN)
specificity = TN / (TN + FP)
print(f'The model can correctly predict {accuracy*100:.2f}% of the outcomes')
print(f'The model can correctly predict {sensitivity*100:.2f}% of heart disease actually occurring')
print(f'The model can correctly predict {specificity*100:.2f}% of heart disease not occurring')
Apparently the model has great accuracy and specificity but terrible sensitivity. The sensitivity is extremely low because the number of false negative is significantly higher than true positive. This shows that the model fails to predict the majority of people who are likely to develop heart disease. Consequently the model can be deemed unreliable as it is ineffective to warn those people so that they can take necessary precaution.
A possible solution to this issue is to reduce the threshold. Previously, a standard threshold of 0.5 is used, which means that only those with probability of getting heart disease above 0.5 will be considered high risk. A lower threshold will result in those having lower probability also considered as high risk. For example, if we were to decrease the threshold to 0.2:
The sensitivity has improved drastically from 4.49% to 53.86%, but the accuracy and specificity have dropped as the result. Still, this seems like the better option considering the drastic increase in sensitivity with a still acceptable specificity. Alternatively, we can improve the model by having more sample with TenYearCHD = 1. This issue might be caused by the fact that the current sample contain significantly higher proportions of TenYearCHD = 0.
Conclusion and Recommendations
Amongst the 6 predictors that were considered, apparently the best model only contains 3 predictors: age, systolic blood pressure, and gender. The model showed that male tend to have higher heart disease risk. This risk is also increased as they get older or have higher systolic blood pressure. People matching this conditions should be advised to keep an eye on their heart health more diligently, either by exercising routinely or watch their eating habits.
The model doesn’t have a particularly high sensitivity, even after lowering the threshold from 0.5 to 0.2 . This is probably because the sample data consist mainly of negative outcome, therefore, the model didn’t have enough information to identify the positive outcome. Collecting more samples with positive outcome might help in balancing out the data and improving the model sensitivity. However, if this is not an option due to the time, monetary cost, or other constrains, over-sampling or under-sampling method can be done instead.
Reference:
Code used for analysis can be obtained form GitHub.