Clustering Fast Food Items by Nutritional Value

Pranav Vijay
INST414: Data Science Techniques
6 min readMay 16, 2024

Question: How can I choose healthier meals if my only options for restaurants are fast food restaurants?

The main stakeholder is a fast food consumer who lives in a lower-to-middle income household and is health-conscious. There is more awareness of the importance of eating healthy. If we give more information about the food the customer is consuming, it will help them lead a healthier lifestyle, and they will not need to spend much on healthcare expenses in the future. As the saying goes, prevention is better than cure.

The decision that the stakeholder will make after finding the answer to this question is ordering the food items from the fast food restaurants based on a cluster of food items that is grouped similar to the nutritional priorities of the stakeholder.

The data that can help answer this question is a dataset of all the food items in the menus for all the popular fast food restaurants. The most important data fields would contain details about calories and relevant nutritional information. The dataset would also contain information about what food items are ordered the most. This data is relevant to my question since it will have nutritional information about each food item and can help me create clusters of similar food items based on their nutritional value.

I collected a subset of this data from the Kaggle website, which contains many datasets available for downloading. The data I collected was nutritional information about food items from McDonald’s and Burger King. The first dataset was called “Nutrition Facts for McDonald’s Menu,” which is by McDonald’s and Abigail Larion. This dataset did not contain info about which items are ordered the most. The second dataset was called “Burger King Menu Nutrition Data” which is by Matt Op. This dataset also did not contain info about which items are ordered the most. This dataset was also missing more nutritional information than the McDonald’s dataset since it did not have columns about the calcium, iron, Vitamin A, and Vitamin C amounts in the food items. I created a Jupyter notebook to create a program to analyze the CSV files and create clusters. I used a Python kernel for my program. I downloaded the CSV files of the dataset and read them in Python through a Pandas dataframe using the read_csv() function. I imported the KMeans class from the sklearn.cluster module in order to do K-Means clustering and find similar data points in my dataframe. I imported the matplotlib.pyplot module in order to create and display a plot for the Elbow Method, which would help me identify a value for k, the number of clusters for my dataset.

I cleaned up my data by making sure that no values were missing from both datasets. After reviewing the datasets, no values were missing. When doing the analysis, I decided that I did not want to include foods that are in the categories of “Coffee & Tea,” “Smoothies & Shakes,” “Beverages,” “Snacks & Sides,” and “Desserts” since these are only side orders, and I wanted to focus on meal items. I also removed items from the “Breakfast” category since I wanted to focus on meals eaten during lunch and dinner. I created a dataframe for food items in McDonald’s. I created another dataframe for food items in Burger King. In my dataframes, I removed rows that had the above values in the “Category” column. I also removed any columns that weren’t needed in my current data analysis from both dataframes. In my McDonald’s dataframe, I removed the “Serving Size,” “Calories from Fat,” “Total Fat (% Daily Value),” “Saturated Fat,” “Saturated Fat (% Daily Value),” “Trans Fat,” “Cholesterol (% Daily Value),” “Sodium (% Daily Value),” “Carbohydrates (% Daily Value),” “Dietary Fiber,” “Dietary Fiber (% Daily Value),” “Sugars,” “Protein,” “Vitamin A (% Daily Value),” “Vitamin C (% Daily Value),” “Calcium (% Daily Value),” and “Iron (% Daily Value)” columns. In my Burger King dataframe, I removed the “Fat Calories,” “Saturated Fat (g),” “Trans Fat (g),” “Weight Watchers,” “Protein (g),” “Sugars (g),” and “Dietary Fiber (g)” columns. I added another column called “Restaurants” so that I could differentiate food items by what restaurant makes them. I renamed the columns in my Burger King dataframe to be similar to the columns in my McDonald’s dataframe so that it would be easier to combine both dataframes. I had the same columns in both dataframes, and I combined the dataframes together into one new dataframe.

My project builds upon clustering methods from Module 4 in the course. The similarity metric I am using is Euclidean distance. I am measuring similarity between data points in my dataset by using K-means clustering, which uses Euclidean distance to group data points into clusters. I selected a value for k, the number of clusters, by using the Elbow Method. After observing the Elbow Method graph I generated from my combined dataframe, the value where the graph starts to level off is at k=4. As a result, I decided that the number of clusters for my dataset was 4 clusters.

To answer my motivating question, I used K-means clustering. The features of my dataset that I focused on to measure similarity were calories, total fat, cholesterol, sodium, and carbohydrates.

Here are the tables for each of the four clusters with the averages for each feature:

The first cluster represents items that have the highest average calories, sodium, cholesterol, total fat, and carbohydrates. The second cluster represents items that have the third highest average calories, sodium, cholesterol, total fat, and carbohydrates. The third cluster represents items that have the second highest average calories, sodium, cholesterol, total fat, and carbohydrates The fourth cluster represents items that have the lowest average carbohydrates, sodium, calories, total fat, and cholesterol.

Here is a table for each of the four clusters with at most five items listed for each cluster:

Here is a figure of the Elbow Method graph that I generated to find the k value:

If the stakeholder is looking for items with the lowest carbohydrates, sodium, calories, total fat, and cholesterol, they should order items from the fourth cluster. If the stakeholder wants to avoid items with the highest carbohydrates, sodium, calories, total fat, and cholesterol, they should avoid items from the first cluster.

There are some limitations to my analysis. I did not conduct an analysis for breakfast items. The stakeholder may eat breakfast items from fast food restaurants, and may have benefitted from seeing clusters of breakfast meals instead. I also only focused on food from McDonald’s and Burger King. The stakeholder may eat more at other fast food restaurants, like Chick-Fil-A or Wendy’s. This analysis may be biased since the nutritional features that I am focusing on are the calories, total fat, cholesterol, sodium, and carbohydrates of the food items. The stakeholder may care about other nutritional features, like dietary fiber or sugars.

Here is a link to my GitHub repository that contains the Jupyter notebook that I used to do K-means clustering on the datasets to group the food items from McDonald’s and Burger King into clusters. The GitHub repository also contains the original CSV files of the nutrition facts for the McDonald’s food items and the nutrition facts for the Burger King food items.

Link: https://github.com/pvijay2024/final

Here is a link to the original Kaggle dataset I used to get nutrition data about food items from McDonald’s:

Link: https://www.kaggle.com/datasets/mcdonalds/nutrition-facts

Here is a link to the original Kaggle dataset I used to get nutrition data about food items from Burger King:

Link: https://www.kaggle.com/datasets/mattop/burger-king-menu-nutrition-data

--

--