Cluster Analysis: Obesity

By Dante Caraballo and ChatGPT

Dante Caraballo
INST414: Data Science Techniques
5 min readApr 15, 2024

--

Question and Stakeholder

Question: How can healthcare providers identify distinct groups within a population based on obesity-related factors to tailor intervention programs more effectively?

Stakeholders: A public health organization aiming to deploy targeted health intervention programs designed to address different factors contributing to obesity. Healthcare providers seek to offer personalized health and nutritional advice to their patients based on their specific risk factors and habits related to obesity.

The Data

The original “Obesity Risk” CSV data was found on Kaggle it contained the following columns:

Age: Age of the individual.

Height: Height in meters.

Weight: Weight in kilograms.

Gender: Gender of the individual.

Family History with Overweight: Binary indicator of a family history of overweight.

FAVC: Frequent consumption of high-caloric food (Yes/No).

FCVC: Frequency of consumption of vegetables (Low/Medium/High).

NCP: Number of main meals (1–4).

CAEC: Consumption of food between meals (Never, Sometimes, Frequently, Always).

SMOKE: Smoking status (Yes/No).

CH2O: Daily water drinking amount (Low/Medium/High).

SCC: Calories consumption monitoring (Yes/No).

FAF: Physical activity frequency (None, Low, Moderate, High).

TUE: Time using technology devices (Low/Medium/High).

MTRANS: Main transportation mode (Automobile, Bike, Motorbike, Public Transportation, Walking).

Relevance: This data can reveal clusters representing different risk profiles related to obesity, guiding public health organizations in developing targeted interventions. Additionally, the data collectively helps identify lifestyle and health profiles, guiding interventions at both the population and individual levels.

View the data here

Data Cleanup

Before diving into the clustering analysis, it was crucial to prepare the data to ensure that it accurately reflected the population’s health behaviors and outcomes. My data preprocessing involved several key steps. First, I handled missing values, which can potentially skew the analysis if not handled properly. For categorical variables, I input missing values with the mode, while for continuous variables, I used the median, maintaining the distribution’s integrity.

Next, I encoded categorical variables to transform them into a format suitable for analysis. Variables like Gender, FAVC (Frequent consumption of high-caloric food), CAEC (Consumption of food between meals), and MTRANS (Main transportation mode) were converted using one-hot encoding. This approach transforms categorical data into binary vectors (0’s or 1’s), ensuring that our model can interpret these features correctly.

Numerical variables, including Age, Height, and Weight, underwent scaling to normalize their ranges. This step is vital because it ensures that no single feature dominates the clustering process due to differences in scale. I used StandardScaler from the sklearn library, which standardizes features by removing the mean and scaling to unit variance.

Analysis Process

Data Preprocessing: Categorical variables such as Gender, FAVC, CAEC, and MTRANS were encoded. Numerical variables like Age, Height, and Weight were scaled.

Clustering:I applied K-Means clustering with an empirically determined k value, identifying 5 distinct clusters within the population based on their lifestyle, dietary habits, and other obesity-related factors. I determined the optimal number of clusters (k) through empirical analysis, settling on five as it provided the most insightful grouping.

The algorithm iterates through two main steps, assigning data points to the nearest cluster center (centroid) and then updating the centroids based on the mean of the points within each cluster. This process repeats until the algorithm reaches a solution where the centroids no longer significantly move, indicating that the clusters are as compact and distinct as possible.

Visualizations

I employed PCA for dimensionality reduction to visualize the clusters in a 2D space, enhancing the interpretability of how individuals are grouped based on their features. Additionally, I examined Physical Activity Level and Vegetable consumption across all clusters and found that cluster 3 seemed to be the most physically fit out of all the clusters.

Cluster Interpretation

Cluster 0: Young Adults with Moderate Physical Activity — Predominantly composed of individuals engaging in regular physical activity, with moderate consumption of high-caloric food.

Cluster 1: Older Adults with Low Vegetable Intake — Made up of older individuals with low vegetable consumption, highlighting a potential target for dietary interventions.

Cluster 2: Tech-Engaged Sedentary Individuals — Consists of people spending significant time using technology with low physical activity, suggesting the need for promoting more active lifestyles.

Cluster 3: Active Individuals with Healthy Eating Habits — This represents a group with high physical activity and healthy eating habits, serving as a positive model.

Cluster 4: Individuals with High Caloric Food Consumption — indicated through frequent consumption of high-caloric food, indicating a group that might benefit from nutritional education and support.

Limitations

Data Representation: The dataset may not fully represent the broader population due to sampling biases. The analysis may be biased towards individuals with access to healthcare or online surveys, potentially underrepresenting certain demographics. Additionally, self-reported data can introduce bias in reporting habits and behaviors.

Behavioral Complexity: Simplification of complex behaviors into categories might overlook nuanced factors contributing to obesity. Additionally, bias in reporting personal data may be a factor. A person could underestimate or overestimate a number of these categories. That could in turn skew the data and make it less representative of the actual population's habits.

This scenario provides a comprehensive overview of how data analysis and clustering can inform targeted interventions in a healthcare context, specifically addressing obesity.

Conclusion and Recommendations

The analysis identifies distinct risk profiles within the population, offering valuable insights for tailoring intervention programs. For example, Cluster 3 suggests a focus on reducing sedentary behavior, while Cluster 5 highlights the need for dietary interventions. These findings can inform public health strategies aimed at combating obesity through targeted, evidence-based interventions.

Furthermore, this analysis not only highlights distinct risk profiles within the population but also underlines the importance of personalized intervention strategies. For example, while Cluster 3 might benefit from programs aimed at reducing screen time and increasing physical activity, Cluster 5’s needs might be more effectively addressed through nutritional education focusing on the long-term impacts of high-caloric food consumption.

Public health organizations can leverage these insights to deploy more targeted, and effective interventions. By understanding the specific needs and behaviors of each cluster, interventions can be designed to resonate more deeply with individuals, encouraging healthier lifestyle choices and ultimately reducing obesity rates.

View Github Code

--

--