Fast food item nutrition clusters

Calvin Chu
INST414: Data Science Techniques
4 min readDec 8, 2023

Insight:

Fast food has become an icon among all Americans across the country from all ages. However, nutrition has become one of America’s biggest issues within these fast food chain restaurants. Many people that order items off the menu may be unaware of the nutritional content of their food and may put themselves at risk of future illness/ health problems. So for my insight, I want to calculate the similarity of variety of fast food item and cluster them based on their nutritional content,mainly calories and cluster them based on their nutrition. This could allow American to find new food items with better nutrition than their original choices and introduce food items that can be delicious and healthly. Also, this would help fitness experts/nutritionist to examine what fast food items can be consume and be consider acceptable for people to eat based on their dietary restriction/health plans.

Data Collection/Similarity

For my data, I collected it from Kaggle, and it has 515 rows with 17 columns of nutritional contents such as calories, fats, protein, carbs, sat fat, trans fat, cholesterol, sugar, fiber, sodium, vitamin A, vitamin C, Calcium, item name, restaurant, and salad. For similarity, the features I am using the nutritional values to determine the cluster similarity such as calories, fats, protein, carbs, sat fat, trans fat, cholesterol, sugar, fiber, sodium, vitamin A, vitamin C, and Calcium. For the similarity metric for this data frame, I decided to use Euclidean distance from the KMean to determine similarity and clustering.

K value selection

For my K value, I wanted to cluster the data to be short and simple for analysis since the data is only 500 rows .Thus, I decided that the K value is going to be 3 for simplicity of analysis and to separate the three clusters based on the nutritional values This K value would be perfect for analysis on which items have highest nutritional value or lowest without having several clusters to search.

Cluster representation:

For each cluster, I believe the clusters represents different type of quality in fast food items such as cluster 1 represent the specialty burgers/sandwiches such as the smokehouse burgers, quarter pounder, and many more. Cluster 1 can mean that these items have average amount of nutritional value compared to the rest of the clusters. Cluster 0 represents the classic original item such as classic cheeseburger, chicken sandwich, Double cheeseburger, and more. This cluster means that this group has simple items with above average nutritional value. Cluster 2 represent the fried chicken items such as 20-piece nuggets, chicken tender, fried chicken, and more chicken items. This cluster shows fast food chicken items, mainly fried items which means cluster 2 has lowest nutritional value in terms of low chances of being healthy based on the nutrition.

Data cleaning/software

For this data analysis, I used sklearn.cluster to import KMeans for the Euclidean distance similarity and constructing the clustering of the data. Then I used collection to import Counter to used the value from KMean to build the dataframe for displaying the clusters. Also, I import matplotlib.pyplot for data visualization to create graphs to display the clusters and compare based on calories counts average. For cleaning, I dropped the columns that are irrelevant to the nutritional values such as restaurant names and salad options(boolean). I also filled any Null values in the dataframe with 0 and set the index with the fast food item name. When I created the data frame with the clusters, I merge that with the nutritional value matrix by item name and groupby them by clusters for data visualization.

# setting up the data from file 
df = pd.read_csv("fastfood_calories.csv")
df.columns
df.drop(columns = ['Unnamed: 0','restaurant','salad'],inplace=True)
df.set_index('item', inplace = True)
df.fillna(0,inplace =True)
#using collection package and value from KMean
power = Counter(label)
create = pd.DataFrame(label)
create = create.set_index(df.index)
combine = pd.merge(create, df,on = 'item')
Final product of the dataframe

Table/finding

Bar graph of cluster and their average calories
Scatterplot of clusters and their calories

Limitation

For this data, I believe this data is biased that most items listed in this dataset is from McDonald and doesn’t include all fast food items from all fast food restaurants. This affects the results that most items with higher nutritional value or lower are from McDonald and give less opportunities for other restaurants to be included in the clustering. Another biased that we need to addressed is that clusters are based on nutritional values and not based on their ingredients and preparation which can further the analysis on the nutritional value.

Github link:

--

--