How your diet has an impact on developing Anemia-A predictive analysis

Published in

Clique Community

6 min readAug 20, 2021

A step by step approach of Anemia prediction

Image as featured in an Editorial by the National Cancer Institute on Unsplash

Your Dietary habits have a huge impact on your quality of life! Across the world, it has been a huge area of concern, as an increasing trend in diseases has been observed. Anemia is a very common disease that affects 30 % of the total population! This blog post aims to identify the factors which cause anemia by observing their diet intake.

Data Preprocessing

Data from the National Health and Nutrition Examination Survey (NHANES) which is major program of the National Centre for Health Statistics (NCHS)across 5 years 2013–2018 was considered for analysis. It had various kinds of data- demographics, dietary, examination, laboratory, and questionnaire datasets.

Data preparation

Missing values were tackled with the help of imputation and features with a high percentage of missing values were not considered for analysis.

Target variable

Anemia-the target variable was created by analyzing hemoglobin levels,an indicator of iron levels.`LBXHGB`, Hemoglobin count was used from laboratory data.Females with count below 12 and males with count below 13 were classified as anemic.This resulted in a binary classification problem with two classes: 0-No anemia,1-Anemia.

Feature selection

· Important features were selected from demographic, examination, and laboratory data which had around 200 features.

· Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.[1]

· However dietary data had several features from which the important ones, top 40 were selected with the help of Recursive feature elimination. Some of the important features included lobster,clams,mussels,crayfish intake in last 30 days and cancer diagnosis,pregnancy,BMI ,glycohemoglobin from other datasets.Model building had to be done to finalize the most important features.

#Split the dataset into train and Test
from sklearn.model_selection import train_test_split
seed = 195
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)# Random Forest Regressor
rfe = RFE(estimator=RandomForestRegressor(),n_features_to_select=30)# Fit to the training data
_ = rfe.fit(X, y)
newvar=X.loc[:, rfe.support_]
newvar.describe()

Model building

I tried out two models-Logistic regression and XGboost.

1.Logistic regression

Being a binary classification problem, logistic regression was first tried out as it is very simple and fast .Results were not satisfactory, as it failed in predicting instances of anemia due to class imbalance

2. XGboost

XGBoost is short for Extreme Gradient Boosting and is an efficient implementation of the stochastic gradient boosting machine learning algorithm.[2]

It’s an ensemble method and this gradient boosting showed good results with this classification problem

The class imbalance was tackled by adding hyperparameter scaled_pos_weight, designed to tune the behavior of the algorithm for imbalanced classification problems. This has the effect of scaling errors made by the model during training on the positive class and encourages the model to over-correct them.

xgb_clf_tuned_1 = XGBClassifier(**params1,random_state=45, n_jobs=-1)xgb_clf_tuned_1.fit(x_train, y_train);

Hyperparameter Tuning of XGBoost using OPTUNA

OPTUNA is a hyperparameters optimization framework based on Bayesian methods a powerful hyperparameter optimization framework. The XGboost hyperparameters were tuned in order of importance groups based on the chosen primary evaluation metric -f1 score.

Some of the most important parameters are learning_rate, max_depth, min_child_weight. Other parameters of lower importance include subsample, colsample_bytree, and the regularization terms. The Interval range for each parameter was set and after several trials, best-optimized parameters were chosen

I also used Stratified Kfold cross-validation which is a procedure used to estimate the model performance on new data, by portioning data into subsets called folds and when the subsets are stratified each set contains approximately the same percentage of samples of each target class, that’s anemia and no anemia.

The objective function is defined with the trial parameters. The objective function has model logic and returns the scores. Study is an optimization based on an objective function.Optuna study object is created to manage optimization. It will show all the trials with the objective function & parameter values and the optimized one is found out by comparing different parameter values.Trial is a single execution of the objective function.

PARAMETERS USED

 params1={‘learning_rate’: 0.02578297296713958,
‘num_boost_round’: 585,
‘objective’: ‘binary:logistic’,
‘max_depth’: 7,
‘min_child_weight’: 18.93704847111416,
‘n_estimators’: 754,
‘min_samples_split’: 9,
‘scale_pos_weight’: 3.425510435559797,
‘subsample’: 0.505407237617943,
‘colsample_bytree’: 0.5538227649946694}

Results

MODEL RESULTS-XGBOOST

Evaluation metrics-Classification

F1 score ,AUC & ROC curve was primarily used for model evaluation.

The model shows a satisfactory F1 score for both classes, anemia and no anemia,showing how precise and robust the classifier is,exhibiting good model performance.Accuracy couldn’t be used due to class imbalance.

This model shows a good AUC of 0.73 It shows that the classifier is able to distinguish between the classes pretty well.The larger the area under the ROC curve, the better is the model.AUC is useful even when there is a high-class imbalance.

Learning Curve

Learning curve shows how the model is performing on the number of samples of data it’s being fed.

· Train Learning Curve: It gives an idea of how well the model is learning on training data.

· Validation Learning Curve: It is calculated from a validation dataset that gives an idea of how well the model is generalizing.

The model is showing better performance with more samples which can be improved when it’s trained on more data.Training score is higher than cross-validation score

Shap Interpretation

SHAP measures the impact of variables taking into account the interaction with other variables[3]. Shapley values calculate the importance of a feature by comparing what a model predicts, with and without the features.

Shap values are provided in the x-axis.Variables are ranked in descending order of their feature importance.The horizontal location shows whether the effect of that value is associated with a higher or lower prediction. Color shows if that variable is high (red) or low (blue) in feature value for that observation.

#Shap values
shap_values = shap.TreeExplainer(xgb_clf_tuned_1).shap_values(x_train)#Summary Plot
shap.summary_plot(shap_values, x_train)

Inference

· Lower values of Glycohemoglobin increase chances of Anemia.

· 1-Male,2-Female, Positive SHAP for males -> higher chances of anemia.

· Lower the weight, higher the chances of Anemia.

· Age and BMI there is no consistent threshold observed to make an inference.

CONCLUSION

When diet features are taken into consideration a seafood-based diet has a high impact on anemia prediction as many of the top features include seafood like scallops, bass, mussel, oysters. A person’s dietary intake has a strong impact on anemia diagnosis and maintaining regular data of dietary habits and greatly reduce the risk of diseases.

REFERENCES

·1)https://www.scikit-yb.org/en/latest/api/model_selection/rfecv.html

2)https://machinelearningmastery.com/xgboost-for-imbalanced-classification/

3)https://www.r-bloggers.com/2019/03/a-gentle-introduction-to-shap-values-in-r/

Project Link