Network Analysis of Food Ingredients: A Deep Dive into Grocery Layout Optimization

David Chun
INST414: Data Science Techniques
2 min readOct 31, 2023

Everyone who enjoys cooking their own meals has encountered the challenge of finding ingredients scattered throughout the grocery store. If you’re the store manager, you’re probably somewhat aware of this issue and might organize the grocery items in a way that maximizes convenience, leading to maximized profits. To address this challenge, I utilized distance metrics, such as eigenvalue centrality, to identify the most important and common ingredients in foods. By doing this, managers could prioritize stocking these common ingredients and placing them in convenient locations.

Data Collection and Cleaning Methodology:

I sourced the required data for creating the network from the USDA food API, which contains ingredients for common foods. I compiled a list of approximately 75common menu items, as the number of API calls was limited, necessitating a pause of about 10 seconds between each one. Gathering data beyond these 75 items would have been excessively time-consuming, considering I waited nearly 10 minutes to retrieve the JSON file with the ingredients.

Upon obtaining the data, I encountered an issue: instead of the JSON values being lists, each food’s ingredients were consolidated into one extensive string. Additionally, numerous entries were variations of each other, like “Mozzarella cheese milk” and “Asiago cheese milk.”

To address this, I transformed the elongated strings into lists by splitting the ingredients based on commas and trimming any excess white space. After converting the strings to lists, I removed words within parentheses, non-letter characters, and spaces to somewhat standardize the ingredients. Yet, this wasn’t sufficient for network analysis. Hence, I utilized two tools unfamiliar to me previously: fuzzy matching and spaCY.

I first employed fuzzy matching to consolidate ingredients with slight spelling differences (e.g., “carrots” vs. “carrot”). The more challenging task was to consolidate ingredients that were variations of each other. Here, the spaCY library was invaluable. Using lemmatization, I reduced words to their base form, focusing on nouns. Consequently, ingredients like “Asiago cheesemilk” and “Mozzarella cheesemilk” were simplified to “cheesemilk.”

With the data finally cleaned, I constructed the network for analysis, representing ingredients as nodes and foods as edges.

Analysis:

Utilizing degree centrality and eigenvector centrality, I pinpointed the three most common ingredients (nodes): salt, spice, and onions. These three ingredients consistently ranked highest in both centrality metrics. While the prominence of salt and spice was anticipated, the significance of onions in cooking was intriguing. Store managers can leverage this insight to stock these items abundantly and place them at strategic, convenient locations, such as the store’s entrance.

Limitations:

A notable limitation of this network analysis is the modest sample size: only 75 distinct foods were analyzed. This limitation arose primarily because making numerous API calls for a larger set of foods proved too time-intensive.

https://github.com/dvc0310/ingredient_network_analysis

--

--