I chose to further analyze the dataset I used for my EDA for the final project. The dataset recorded the obesity levels of people from Mexico, Peru, and Colombia alongside their eating habits and physical condition. As the project asked us to build a machine learning model, I was interested in building an accurate model around if a person is obese or not — a two-class problem — as well as finding the features that would be most relevant in training this model.
The dataset I used has the data of 2111 individuals aged 14 to 61 and 17 attributes. Many of these attributes have acronyms, so I briefly described all of them below:
- Gender: 1= female, 2 = male
- Age: numeric
- Height: numeric, in meters
- Weight: numeric, in kilograms
- family_history (family history of obesity): 1 = yes, 2 = no
- FCHCF (frequent consumption of high caloric food): 1= yes, 2= no
- FCV (frequency of consumption of vegetables: 1 = never, 2 = sometimes, 3 = always
- NMM (number of main meals): 1, 2, 3 or 4 meals a day
- CFBM (consumption of food between meals): 1=no, 2=sometimes, 3=frequently, 4=always
- Smoke: 1= yes, 2= no
- CW (consumption of water): 1 = less than a liter, 2 = 1–2 liters, 3 = more than 2 liters
- CCM (calorie consumption monitoring): 1= yes, 2 = no
- PAF (physical activity frequency per week): 0 = none, 1 = 1 to 2 days, 2= 2 to 4 days, 3 = 4 to 5 days
- TUT (time using technology devices a day): 0 = 0–2 hours, 1 = 3–5 hours, 2 = more than 5 hours
- CA (consumption of alcohol): 1= never, 2 = sometimes, 3 = frequently, 4 = always
- Transportation: 1 = automobile, 2 = motorbike, 3 = bike, 4 = public transportation, 5= walking
- Obesity (target variable): 2 = not obese, 4 = obese
- Import libraries
First, I imported the libraries I would need to understand and train my data. After, I used google.colab to import the CSV file, and then loaded the data into a data frame using pandas. The first five data points are shown through the .head function.
2. Check shape and values
The .shape function accurately returned (2111,17), and the heatmap confirmed that there are no missing values.
Next, I called .corr to get a correlation matrix, as this would help me determine if there are any variables I should drop because they’re too correlated with the target variable, obesity. I turned the matrix into a heatmap, as it allows me to efficiently recognize the variables that are highly correlated or uncorrelated.
The heatmap shows me that one variable, the weight variable, is highly correlated with obesity (0.79). This makes sense, as weight is an integral factor in determining a person’s BMI and if someone is obese.
4. Drop variable
Knowing that weight is so highly correlated with obesity, I need to drop the variable to make sure my model can learn from the other variables. When I ran my models with the weight feature, my ensemble models all came out with accuracies of 98.74% to 99.16% without any tuning, and even my logistic regression model had an accuracy of 97.68%. Dropping it will lower their accuracies and allow me to analyze the other features more.
5. Dataset overview
To check if the variable dropped and the overall accuracy and makeup of my dataset, I called the below four functions to give me a quick overview.
6. Target variable distribution
Lastly, I created a countplot to get the distribution of the target variable so I can see if it is balanced or not to inform the training and testing sets.
This dataset is pretty balanced, as there are almost as many obese instances as not obese instances. The target variable here is clearly a case of binary classification, which will inform our future model choices.
For further data exploration, please refer to my EDA article.
Before the data can be split, it should be normalized because the ranges of the dataset features are not the same. This can be problematic because a small change in a feature may not affect the other, so the ranges are normalized to a uniform range of 0–1.
2. Splitting test and training data
The dataset is partitioned into training (70%) and testing (30%) sets, and the respective shapes are printed to make sure the data was split correctly before the models are built.
These numbers indicate that the training set has 1477 data points, while the testing set has 634 data points.
3. Baseline classification accuracy
It’s helpful to calculate the baseline classification accuracy because it is the simplest possible prediction. This gives us a good starting point to keep in mind when creating more accurate models with the goal of achieving a better score.
We got a low baseline accuracy of 54.57%, so our subsequent models should have a higher accuracy than this.
I chose the following six models because they work well with binary classification problems, such as the obesity one this post is about.
Model #1: Logistic regression & hyperparameter finetuning
The first model I’ll train is the logistic regression model, as it is a classification algorithm that predicts the probability of a categorical variable. I instantiated it using default parameters, besides random_state because I want the split to be the same every time. The model is then fit with data and the prediction is made and evaluated. Afterward, the model is cross-validated through a standard 10-fold cross-validation in which the data is split into 10 subsets and each set is held aside in turn as a validation dataset to determine fit.
Here, we can see that the logistic regression model is accurate 77.91% of the time, which is much lower than the 97.68% the model got when the weight feature was still present. The cross-validation score of 71.01% came out to be substantially lower than the model accuracy. This indicates that our model may be overfit, or align too closely with the present data and random noise instead of the actual relationship between variables. As the difference is almost 8%, we should finetune the model’s hyperparameters and see what happens when we use GridSearchCV. GridSearchCV is seen as one of the more accurate methods due to its large number of iterations, as it runs through every combination of hyperparameter values.
Now that GridSearchCV has given us the best hyperparameters to use for C, penalty, and solver, we should use them and reevaluate the accuracy.
After using the suggested hyperparameters, our cross-validation score of 72% is closer to the 75.72% output from GridSearchCV, which means that our model was improved and is not as overfit as it used to be. However, I was curious about how the other popular method of tuning hyperparameters, RandomizedSearchCV, would fare, as it takes a different approach by creating a grid of hyperparameter values and choosing random combinations when training.
I inserted RandomizedSearchCV’s key hyperparameters back into the logistic regression to observe its performance.
While RandomizedSearchCV’s hyperparameters still improved the difference between the cross-validation and RandomizedSearchCV score compared to the initial logistic regression model accuracy and its cross-validation score, it did not minimize the difference as much as GridSearchCV did. Therefore, we’ll take a look at the top features from the logistic regression using GridSearchCV’s hyperparameters.
This graph shows us the relative importance of different features for predicting obesity. We can see that the three most important features for this model are CFBM (consumption of food between meals), age, and family history. At face value, these variables seem to make sense in their larger influence on obesity, but we will continue to explore more models and get a better consensus.
We can also quickly create a classification report on the logistic regression model. The precision category refers to the percentage of predictions that are accurate, the recall category details the percentage of positive cases that were found, and the F1 score is the percentage of positive predictions that were correct.
Moving forwards, I will be using a variety of ensemble methods, which are predictive models that combine multiple models to get a final prediction.
Model #2: Bagging
Bagging, also known as bootstrap aggregation, is an averaging ensemble method that is used with decision trees and aggregates the predictions of multiple models. n_estimators, a parameter of BaggingClassifier and other similar models, can make a large difference on the accuracy of the model, and it can be difficult to determine the optimal n_estimator to use. Therefore, we can use a graph of increasing values of n_estimators and their resulting testing accuracy to help us choose a n_estimator with the highest possible accuracy.
This graph tells us that the highest accuracy occurs when the n_estimator is around 79, so that’s what we’ll use for our bagging model.
The bagging model gives us an accuracy of 93.06% and a cross-validation score of 92.14%, which is a huge improvement from what we got from the logistic regression model, likely because it is the product of several models that were trained individually and averaged. The difference between the two scores is also substantially smaller, indicating a better-fitted model and negating the need to finetune hyperparameters as we did with the logistic regression model. All subsequent models with a small difference (<1%) between the cross-validation and accuracy scores will not be fine-tuned.
A classification report is printed just to confirm accuracies:
The report tells us that the model has a higher accuracy for both objects in class 0 and class 1 compared to the classification report of the logistic regression model.
Model #3: Random Forest
Next, I’ll build a Random Forest model, which is another averaging method, and creates an ensemble of de-correlated decision trees.
I created another graph on n_estimators and their testing accuracies to give me an idea of what n_estimator I should use for my Random Forest model.
This graph indicates that a n_estimator of around 132 would help us achieve a high accuracy. However, the magnitude of accuracy still hinges on the other parameters I dictate or don’t dictate.
max_features is the size of the random subsets of features to consider when splitting a node. While setting it at “auto” gives me a high accuracy of 94.95% and cross-validation of 93.85%, I chose to set it at 6 for a similar accuracy of 94.79% and a higher cross-validation score of 94.04%. As we can see, the Random Forest model is the best performing model yet, with high accuracy and cross-validation scores and with a cross-validation score very close to the accuracy, signaling a much lower degree of overfitness.
I’m also curious about the most influential features per the Random forest model, so I created a simple bar graph to visualize this.
This shows me that age, family history, and height were the most useful features in predicting obesity through the Random Forest model, while other features such as smoking and calorie consumption monitoring were merely noise. I can also quickly pull the top three features with their feature importance values through the following code:
Out-of-bag (OOB) samples are samples that are left out of the bootstrap sample and can be used as testing samples since they were not used in training and thus prevents leakage. As oob_score provides a better model with lower variance and no overfitting, it’s helpful to use in validating the model. In the following code, I compare the OOB accuracy score with the Random Forest model’s accuracy.
Here, we can see that the OOB score is very close to our previously calculated testing accuracy, confirming that our model is high-performing.
Model #4: AdaBoost
Next, we’ll take a look at a different kind of ensemble method, the boosting method. This differs from the bagging method because those work to reduce overall variance, while boosting models lower overall bias by training weak learners sequentially.
I chose to build an AdaBoost model, which combines several weak learners into one strong learner and changes the weight for every incorrectly classified observation. I started with another n_estimator graph:
I specified more parameters with the AdaBoost model, such as a max_depth of 7 and a learning_rate of 0.5, to further increase the accuracy and cross-validation scores. I tested various instances of these parameters and chose them based on their accuracy output. The model gives us a high accuracy and cross-validation score, with a negligible difference between the two.
I’m interested in seeing how the AdaBoost model’s feature variables stack up compared to the Random Forest model, so I graphed them below.
The feature importance outputs are similar for the most important and least important variables, as height and age both show up in the top three features of both models, and smoking and calorie consumption monitoring are still the least important. However, family history was swapped out for physical activity frequency in the AdaBoost model’s top three features.
Model #5: Gradient Boosting Trees
Gradient Boosted Trees is another boosting model. However, while AdaBoost reweights observations according to their prediction accuracy, Gradient Boosted Trees attempts to fit new predictors to the residual errors from the preceding predictors.
As before, I first visualized the performance of the n_estimators.
The peak here seems to be around 170.
Gradient boosting performs fairly well here too, and the difference between the cross-validation and the accuracy scores is also small.
I was interested in seeing the Gradient Boosted Trees model’s feature importance scores compared to those of the AdaBoost model, so I plotted the feature importance graph and the top three features below.
The Gradient Boosted Trees and AdaBoost models share age as one of the top three features, but the Gradient Boosted Trees model is also influenced by family history and the number of main meals. Smoking and calorie consumption monitoring rank low once again.
Model #6: Voting Classifier
Lastly, we’ll use a Voting Classifier model, which trains via an ensemble of selected models and balances out the weaknesses in individual classifiers. There are two types of voting: hard voting and soft voting. We’ll be using soft voting because it provides more information on probability as opposed to class labels. It involves averaging the probabilities of each class and choosing the class with the highest average as the ultimate prediction. I chose to include the four highest performing models in the ensemble: Random Forest, Bagging, AdaBoost, and Gradient Boosting.
As expected, the Voting Classifier model returned the highest accuracy and cross-validation scores out of all of the models, although it did not outperform the AdaBoost model by much. This may be because soft voting considers actual probability, takes into account each classifier’s uncertainty, and gives more weight to highly confident votes. The Voting Classifier model is the most promising one due to its high accuracy, the small difference between the cross-validation and accuracy scores, its lower variance, and the fact that its final predictions are pulled from the majority vote of contributing models, which were already all high-performing.
There was a significant improvement from the baseline accuracy of 54.57% to the voting ensemble’s accuracy of 95.27%, which was achieved through the aggregation of different models and tuning various parameters, such as n_estimators, max_depth, and learning_rate. While the accuracy is high, the model is not substantially overfitted, as the cross-validation scores for each of the ensemble methods all differed from the model accuracy by less than 1%.
The feature importance calculations show us that age, family history, and height had the largest influence on the models above. Other factors that had a large influence on one of the above models were the consumption of food before meals, physical activity frequency, and the number of main meals. Factors that were not impactful across the board were smoking, calorie consumption monitoring, gender, and transportation. Given the model’s high accuracy, anyone interested in the factors that influence obesity for their own health reasons or other uses can refer to the aforementioned prominent features.