Predictive Analytics on Cardiovascular Diseases based upon Diet Intake

A data science project to detect the risk of diseases in patients

Published in

Clique Community

8 min readSep 5, 2021

Recently I got this amazing opportunity to collaborate with members of the ‘Clique’ community and develop a data science project on how nutrition and food consumption can help us predict the chances of various diseases in patients. This post primarily focuses on detecting cardiovascular diseases (i.e. CVD)

The basic purpose of the project is to predict the risk of developing CVD and provide warnings earlier on basis of the person’s diet so that they can make necessary changes to their food intake and prevent the risk. It can also help them understand what nutrients their diet lack and thus help them balance it.

This article outlines how I went about the project and what were the final results obtained.

Dataset Selection
Data Pre-processing and Exploration
Feature Selection and Data Modeling
Model Results and Interpretation
Deployment using Streamlit library
Conclusion and Future Scope

1. Dataset Selection

In this project, I chose to work on the National Health and Nutrition Examination Survey (NHANES) (2013–14) data, consisting of questions about demographics, socioeconomics, dietary intake, and health. I gathered the data from demographic, diet, laboratory, and questionnaire datasets. The target variable for the project was total cholesterol (‘LBXTC’) in the laboratory.csv. My approach was to see it as a binary classification problem.

If a person had a total cholesterol ≤ 200 mg/dL, it would be classified as ‘Good’ and if total cholesterol >200 mg/dL, it would be classified as “Risk”.

Figure 1: Total count of values for each class of ‘Good’ and ‘Risk’

2. Data Pre-processing and Exploration

For this part, I started with installing and importing libraries and packages required to build the model. I used Sklearn, Shap, and Scipy libraries for model building and data interpretation. There are quite a few modules used from the sklearn library for data preprocessing and modeling, which you can refer to from here.

#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
import timeit
import pickle
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score,precision_recall_curve, roc_curve, accuracy_score
from sklearn.exceptions import NotFittedError
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
import optuna
from sklearn.ensemble import RandomForestClassifier

The next step was to load the data into the notebook using read.csv() . After loading the data, I found the missing value percentage for each variable in demographic.csv and dietary.csv files in ascending order. After that, I set the target column as “LBXTC” (total cholesterol)from the lab.csv and picked some important categorical variables such as “smoking”, “diabetes” etc. from the quesstionare.csv file.

#data pre-processing
x =df1.isna().mean().round(3)*100
x =x.sort_values() #sorted values in ascending order

I dropped all the columns from dietary.csv having missing values>70% since they could cause undesired results. All the four datasets had one common identifier — “SEQN” on basis of which the complete data were merged. Using simple linear interpolation, the leftover missing values were computed.

for c in features:
df[c] = df[c].interpolate(limit_direction ='both')

After this, I segregated the categorical and numerical columns from the dataset. The categorical columns in NHANES are encoded in numbers. The dataset had 4 integer values for categorical columns. I then changed the datatype of categorical columns of ‘Gender’, ‘Dietary recall status, ‘ Angina’., ‘Smoked’ and ‘Diabetes’ to category.

categorical_values = []
for column in final_df.columns:
print(‘==============================’)
print(f”{column} : {final_df[column].unique()}”)
if len(final_df[column].unique()) <= 4:
categorical_values.append(column)
final_df[categorical_values] = final_df[categorical_values].astype('category')

After data preprocessing, the next step was to perform the splitting of data. It was split into train and test datasets with a 70:30 ratio and target column being “class”. Using the sklearn pipeline, I performed StandardScaler with a SimpleImputer on numerical columns and One Hot Encoding on the categorical columns for the “binarization” of the category and include it as a feature to train the model. Lastly, I applied the transformations using ColumnTransformer, LabelEncoder, fitted the data using fit_transform and saved the feature names in a list.

After splitting, the various utility functions such as feature_importance_plot , confusion_plot, roc_plot, precision_recall_plot etc. were defined for the evaluation of different classifiers.

3. Feature Selection and Data Modeling

Recursive Feature Elimination (RFE) [1] and Forward Selection are the two methods that I used on the combined diet and demographic variables for feature selection.

In the RFE algorithm, it searches for a subset of features by starting with all the features in the training dataset and successfully removing features until only the desired features remain. It then gives a rank for each variable.

#RecursiveForwardEliminationrfe = RFE(estimator=DecisionTreeRegressor(), n_features_to_select=30)
model = DecisionTreeRegressor()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)n_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))#SequentialForwardSelectorsfs1 = sfs(model, k_features=30, forward=True, verbose=2, scoring='neg_mean_squared_error')
sfs1 = sfs1.fit(X, y)#X and y being features and target df respectively.

Meanwhile, the forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

#Forward Selectionfrom sklearn.ensemble import RandomForestRegressor
from mlxtend.feature_selection import SequentialFeatureSelector
forward_feature_selector = SequentialFeatureSelector(RandomForestRegressor(n_jobs=-1),
 k_features=30,
 forward=True,
 verbose=2,
 scoring=’neg_root_mean_squared_error’,
 cv=4)
forward_selector = forward_feature_selector.fit(X,y)

After RFE and forward selection, the columns of features from RFE having rank>0.5 and columns from the forward selector were intersected as one feature list.

After getting the combined feature list, the correlation heatmap was plotted. The highly correlated predictors were removed after analyzing the heatmap.

#HeatMap
colormap = plt.cm.RdYlGn
plt.figure(figsize=(30,30))
sns.heatmap(new_data.astype(float).corr(), linewidths=0.1, vmax=1.0, square=True, cmap=colormap, annot=True)

Coming to the model selection part, I tried various models on the dataset all of which are listed below:

Logistic Regression

I used Logistic Regression as a baseline model for training. The F1-score for the “Risk” class came to be just 0.16 on test data.

The F1-score is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model’s precision and recall.

XGBoost

After logistic regression, XGBoost [2] was implemented using XGBClassifier with the default parameters. The F1-score improved to 0.28 here.

I performed hyperparameter tuning on XGBoost to get more optimal output. For this, Optuna was used which implements Sequential Model-Based Optimization. The objective and using the ‘trial’ module, the following hyperparameters were defined to tune: learning_rate, num_boost_round, max_depth, min_child weight, n_estimators, scale_pos_weight, colsample_bytree and using study() and pruning functionality, the model was optimized. The F1-score boosted to 0.54 as one can see below.

Figure 5: XGBoost metrics using Hyperparamterized Tuning

Figure 6: F1 score for XGBoost Models, (**trials = value of n_estimators**)

Random Forest

Lastly, I used Random Forest for model training and the F1-score was 0.52 with the default parameters which proved to be the best model amongst the other models.

4. Model Results and Interpretation

I plotted the learning curve using the Yellowbrick suite of visualization tools. The scores were between 0.5 and 0.6. The score(X_train, Y_train) is measuring the accuracy of the model against the training data. The model proved to be a good fit observing the graph having a small generalization gap.

I used the ROC curve for visualization. ROC curves are used in binary classification to study the output of a classifier. Observing the roc_curve of the random forest, the area under the curve (AUC) is 0.71 and has a good steepness in the true positive rate area (Y-axis).

Figure 9: ROC Curve for Final Model (Random Forest)

Comparison between different models — Figure 10: Comparison between Different Models

Using SHAP’s summary plot, the feature importance of the model was plotted. SHAP values [3] are based on Shapley values. It’s a game-theoretic approach to explain the outputs of the model.

#shapvalues
explainer = shap.TreeExplainer(random_forest)
shap_values = explainer.shap_values(x_train)
shap.summary_plot(shap_values, x_train, max_display=10, show=False)

From the shap values, we can observe that the demographic variable ‘Gender’ was the major feature of the model. It was interesting to note that the seafood diet had a major contribution in influencing the prediction of the disease. There was an equal distribution for both the classes (‘Good’ and ‘Risk’).

Figure 11: SHAP feature importance plot for random forest

The overall accuracy of the model came out to be 62%.

5. Deployment using Streamlit

To deploy the model as a webpage, the Streamlit library proved to be a great resource. Streamlit is an app framework to deploy machine learning apps built using Python. I made a GitHub repository and saved the model as a pickle file before finally deploying it.

Figure 12: Dashboard for CVD model (gif)

Visit: Predictive Analytics using Diet Intake Dashboard

6. Conclusion and Future Scope

This project involved analyzing the dietary intake of patients to predict the risk of having any cardiovascular diseases.

Similarly, models can be trained for other diseases as well which will help the patients to improve their diets and cut on costs as well. This will make the patient more aware of their diet intake and give an idea of what nutrients they need to include more and stay healthy.

I had lots of fun while working on this project and got to learn a lot from this experience. I plan to work on improving the accuracy of the model and integrating it with a bigger prototype that helps in detecting the risk of multiple diseases.

I hope you enjoyed reading this post. Please feel free to contact me with questions, comments, and topics you’d like me to write about. It would help improve my content a lot!