Preserved Menus Network Exploration

Published in

INST414: Data Science Techniques

7 min readFeb 23, 2024

The Questions

Food has always been a driving force in industrialization and culture. Restaurants across the world specialize in serving a variety of dishes to different cultural pallets. It is reasonable to think that networking can be used to identify different varieties of restaurants based on the type of food they serve (ex: seafood, American, Italian). By making the edges based on similar dishes between restaurants, centrality could be used to identify how unique each restaurant is. Based on this, three questions can be asked:

1. What communities will emerge in a network analysis of restaurants?

2. What restaurants are the most unique based on centrality?

3. What restaurants are the least unique based on centrality?

Answering these questions could be useful for several different groups. The owner of the data could find it useful for having metadata that can categorize restaurants based on the kind of food they serve. It could also be useful for organizations with a focus on history, helping them to identify which restaurants specialized in which area.

Data Subset & Cleaning

The New York Public Library hosts a significant and historic collection of menus from the 1800s to the 2000s. An initiative named “Whats on the menu?” has seen this collection digitally transcribed by crowdsourcing, formatted, and placed in several CSV files. Given to the author via E-mail, these CSV files are Dishes.csv, Menu.csv, MenuItem.csv, and MenuPage.csv. For the purpose of this exercise, the sponsor (restaurant) in the Menu.csv file, and the dish_id in the MenuItem.csv file will be used to create the network. Since the dataset is normalized, the CSV files must be navigated to get the appropriate information. Notably, MenuPage.csv is needed to accomplish this. To retrieve what sponsors serve which dishes, the following Python code is executed.

menuDict = {} #create empty dict

for index, row in dfMenu.iterrows(): #iterate through the rows
    menuId = row[0] #save the Menu ID
    sponsor = row[2] #save the sponsor name
    
    #break if the length of the dictionary is 100
    if len(menuDict) == 100:
        break
    
    if re.search(r'\bDINNER\b', str(row[3])): #use regex to find when dinner is stated in the event row
        dishList = [] #create empty list to store dishes
        #find the ids(which is menu_page_id in MenuItem.csv) at the menuID for this individual menu
        seriesPageID = dfMenuPage.query(f'menu_id == {menuId}')['id']

        #for each menupage ID...
        for menuPageID in seriesPageID:
            seriesDishID = dfMenuItem.query(f'menu_page_id == {menuPageID}')['dish_id'].dropna() #...find the item IDs for each menuPageID and drop NaN values
            
            #find each dishID
            for dishID in seriesDishID:
                
                dishList.append(int(dishID)) #append the dish to the dish list

        #add the dish to the dictionary, making sure to append if it already exists
        if sponsor in menuDict:
            menuDict[sponsor] += dishList
        else:   
            menuDict[sponsor] = dishList

This piece of code returns a dictionary that contains the sponsors, and a list of dishes that the sponsor has served. The analysis will be performed on this dictionary. Due to the large scale of the files, the dictionary is limited to 100 entries. Once the dictionary reaches a length of 100, the for-loop breaks. Not only does this save time and processing power on this code block, but on the network creation code block that will be discussed later.

Regex is used to search for mentions of the word “DINNER” in the event row of the Menu.csv file. The theory behind this is to prevent communities forming based on breakfast, lunch, and dinner, instead of by culture. Note that the file that contains dish names, Dish.csv, is absent. While this would be useful for a more in-depth analysis, it is not necessarily useful for the creation of the network. Using the dish-id helps to slim the code and reduces the need to include and parse through the large file. If needed in the future, an additional function can be used to identify the name of the dishes based on id. Other unnecessary metadata within the used files such as the date of the menu, page numbers, physical description, and occasion are also not used.

Network Analysis

1. What communities will emerge in a network analysis of restaurants?

2. What restaurants are the most unique based on centrality?

3. What restaurants are the least unique based on centrality?

To answer all three of these questions, a network needs to be created. Utilizing networkx, each sponsor in the dictionary is made into a node. In a series of two nested loops, each dish from each sponsor is compared. If the two dish ids are the same, an edge is created between the two corresponding sponsors. The code that is created is exponential in nature, which is part of the reason the dictionary is limited to 100 entries. The following code will take up from several seconds to several minutes to run.

g = nx.Graph()

for sponsor in menuDict:
    
    #Create a node for every sponsor
    g.add_node(str(sponsor))
    
    i=0
    menuItems = list(menuDict.items())
    
    #double iterate through the sponsors and dishes
    for sponsor1, dishlist1 in menuItems:
        for sponsor2, dishlist2 in menuItems[i+1:]:
            
            #double iterate throush the dish lists
            for dishID1 in dishlist1:
                for dishID2 in dishlist2:
    
                #get current weight if it exits
                    if dishID1 == dishID2:
                        current_weight = g.get_edge_data(str(sponsor1), str(sponsor2), default={"weight": "0"})["weight"]
                        
                        #add an edge
                        g.add_edge(str(sponsor1), str(sponsor2), weight=int(current_weight)+1)
                
        i += 1

Networkx is once again used, this time to write a graphml file containing the network. This is imported into Gephi to create a graph. Within this application, “Yifan Hu” is used to create the layout. Weight edges are sorted so that edges with insignificant weight are not included (Weight settings are set to 835.8–28400.0). Thicker lines mean the weight is greater. The size of the nodes represents their degree. What is returned is a graph that shows little to no significant communities.

To get the most and least unique sponsors, the following code is run to calculate the centrality.

# Calculate degree centrality for all nodes
centrality_degree = nx.degree_centrality(g)

# sort node-centrality dictionary by metric, and reverse to get top elements first
i=1
for u in sorted(centrality_degree, key=centrality_degree.get, reverse=True)[:3]:
    print(f"{i}. {u} has a centrality of {centrality_degree[u]}")
    i+=1

print("===================================================")

# sort node-centrality dictionary by metric to get bottom elements.
i=1
for u in sorted(centrality_degree, key=centrality_degree.get)[:3]:
    print(f"{i}. {u} has a centrality of {centrality_degree[u]}")
    i+=1

What results are the following least unique sponsors with high centrality.

1. Cunard Line with a centrality of 0.92

2. Hotel Savoy with a centrality of 0.90

3. Maxwell House with a centrality of 0.88

The most unique:

1. Mr. S.R. Bloomfield with a centrality of 0.0

2. Legation Des Etat-unis D’ Amerique with a centrality of 0.01

3. TIMEO HOTEL with a centrality of 0.2

Discussion & Potential Bias

Why did no significant communities form? Although more analysis and research on this should be performed, it can be hypothesized that this may be because all the preserved menus specifically focus on one type of food, American food. Most preserved menus could have a very American and European view. However, this begins to go beyond the scope of this post and the knowledge of the author. Further analysis and research are needed.

The three most important nodes in the graph are Cunard Line (red), Hotel Savoy (yellow), and Maxwell House (blue).

Cunard Line is a shipping company that operated first-generation luxury ocean liners, such as the RMS Lusitania and Mauretania. Hotel Savoy is a luxury hotel that was one of the first to define the concept. Both are luxury companies that are fondly remembered by those who got to experience them. On the other hand, Maxwell House is a well-known coffee brand. While troubleshooting the cleaning block of code, a print statement was made that listed all the item ids for each sponsor. It was discovered that a significant majority of sponsors served coffee (id 96). This allows for the conclusion that Maxwell House is such a significant node because they primarily serve coffee.

All three of these companies still exist today, were large companies and were influential in what they did. On the other hand, the companies with the lowest centrality (Mr. S.R. Bloomfield, Legation Des Etat-unis D’Amerique, Timeo Hotel) are not nearly as influential. This suggests that there may be a bias in what kind of menus were preserved and the number of menus preserved are skewing the data. If there are more items listed in the dictionary, there is more of a chance that an edge will be created because the dish can be matched to what other sponsors are listing on their menus. This is supported by the length of the lists of dishes in each dictionary.

High Centrality:

1. Cunard Line: 603

2. Hotel Savoy: 107

3. Maxwell House: 61

Low Centrality:

1. Mr. S.R. Bloomfield: 10

2. Legation Des Etat-unis D’ Amerique: 12

3. Timeo Hotel: 5

This confirms that the analysis is biased towards sponsors with more dishes. However, reasonable conclusions can still be made.

Conclusions

Although the questions presented at the beginning of this post are largely left unanswered due to bias in what menus were preserved, the following conclusions can be made.

Cunard Line, Hotel Savoy, and Maxwell House are all significant sponsors within the dataset.

Mr. S.R. Bloomfield, Legation Des Etat-unis D’ Amerique, and Timeo Hotel are all not significant sponsors within the dataset.

Maxwell House is not a sponsor that serves unique dishes because they are a coffee company, and most sponsors serve coffee.

More research and analysis should be performed on this topic to better understand the New York Public Library’s database of menus. Furthermore, including all events instead of just using dinner menus could help to define communities better in a network graph. Additionally, this analysis could be continued by finding which dishes are the most common and which sponsors served them.

Resources

New York Public Library “Whats on the menu?”: https://menus.nypl.org/

GitHub Repository: https://github.com/smiller1551/MenuAnalysis

This Medium post was created by Simon Miller at the University of Maryland — College Park for INST414: Data Science Techniques under Professor Cody Buntain.

Preserved Menus Network Exploration

Resources

Written by Simon Miller