Attempting to Cluster Historic Menus

Published in

INST414: Data Science Techniques

8 min readMay 15, 2024

Author’s Note

This post is a continuation of “Similarities Between Ocean Liner Menus and More,” which itself is a continuation of “Preserved Menus Network Exploration.” Both posts will be referenced.

The Question

Throughout my last semester at the University of Maryland, I have performed a plethora of analysis on the New York Public Library’s “Whats on the menu?” database, which contains data on historic menus. I have gained an interest in this database through my passion for history. It contains a plethora of ocean liner menus and railroad dining car menus, both of which are of interest to me. Since the beginning of the class this medium post is written for (Data Science Techniques), I have theorized on being able to group restaurants based on the type of food they serve. For example, sponsors (in the dataset, this can be a restaurant, ship hotel, etc.) could be categorized as general fare, seafood restaurants, Italian restaurants, etc. I attempted creating clusters in “Preserved Menus Network Exploration,” but ultimately failed and concluded that the dishes served by sponsors were too similar. In “Similarities Between Ocean Liner Menus and More” some similarities were shown between sponsors that served both overnight accommodations and food. This put some hope back into the theory of categorizing sponsors based on the type of food being served. It wasn’t until Module 4 in the class, where we learned about unsupervised machine learning, that I started once again hypothesizing that clustering by dishes could be done. With this story considered, this post will attempt to answer the following question:

Can unsupervised learning accurately group together sponsors in the New York Public Library’s “Whats on the menu?” database in a meaningful way, and if so, what are the groups?

The specific stakeholder for this question will be the New York Public Library and will help them to categorize sponsors in their database based on the dishes that they served. A new column could be created in the Menu file that helps to describe what kind of dishes were served by that sponsor.

An example of a menu from the New York Public Library’s “Whats on the menu?” database. This was served by Healy’s Forty-second Street Restaurants, a sponsor that sold a significant amount of oysters.

The Data

The ideal data for this analysis is data from the New York Public Library’s “Whats on the menu?” database. This data was retrieved via email from the New York Public Library. The “Whats on the menu?” program oversees the digitization and storage of images and data from the library’s collection of menus ranging from 1851 to 2015. The data is normalized, stored in several .csv files, Dishes.csv, Menu.csv, MenuItem.csv, and MenuPage.csv. To retrieve which sponsors served which dishes, ids must be used to navigate through the various files to connect the sponsors to the dishes. In the “Preserved Menus Network Exploration” and “Similarities Between Ocean Liner Menus and More,” this was done in a series of for loops. This was cumbersome and slow, only allowing 250 sponsors to be loaded before the waiting time for the block to execute became too long. For this analysis, this was replaced with the Pandas merge function. This allowed for the entire dataset to be used, resulting in over 2,593 sponsors. The Python code for this is shown below:

menu_to_menupage = pd.merge(dfMenu, dfMenuPage, left_on='id', right_on='menu_id') #merge Menu and MenuPage
menupage_to_menuitem = pd.merge(menu_to_menupage, dfMenuItem, left_on='menu_id', right_on='menu_page_id') #merge the previous merge with Menu Item

menu_groups = menupage_to_menuitem.groupby('sponsor')['dish_id'].agg(list) #group by the dish_id
menu_groups = menu_groups.reset_index() #reset the index

menu_groups

To fix compatibility issues with reused blocks from previous analysis, the DataFrame returned by this block was transformed into a dictionary:

menuDict = menu_groups.set_index('sponsor')['dish_id'].to_dict() #convert dataframe into a dictionary
menuDict

Similarity Analysis

Code from “Similarities Between Ocean Liner Menus and More” was then used to create matrixes. The rows contain each sponsor, and the columns contain each unique dish_id. Each cell contains the number of times a sponsor serves each dish. The two blocks of code below are used to create the matrix. Please note that these blocks will take some time to execute as they still use for loops. Future work could see the speed and efficiency of these blocks improved.

uniqueList = [] #create an empty list

#iterate through each list in the Menu Diction
for ilist in menuDict.values():
    
    #iterate through each dish in the list
    for dish in ilist:
        
        #if the dish is not in the unique list, add it
        if dish not in uniqueList:
            uniqueList.append(dish)
            
print(len(uniqueList))
print(uniqueList)

#create a dataframe with zeros, columns as the unique list and the index as the keys from the dictionary
df = pd.DataFrame(0, columns=uniqueList, index=(menuDict.keys()))

#iterate through each key and value pair in the dictionary
for sponsor, dishes in menuDict.items():
    
    #iterate through each dish in the list
    for dish in dishes:
        df.loc[sponsor, dish] += 1 #add 1 to that value in the dictionary

df

To ensure there are no magnitude related issues, L1 normalization was performed. This ensures that larger sponsors do not overpower smaller sponsors when it comes to finding similarities. The following code block performs this work:

df_norm = df.divide(df.sum(axis=1), axis=0) #this performs L1 normalization
df_norm.head(10)

Additionally, cosine similarity of the original matrix was performed, much like was done in “Similarities Between Ocean Liner Menus and More.” This allows us to ensure that there are still potential similarities between sponsors despite the significant increase of sponsors. Much like the original, cosine similarity was used to analyze the original matrix because it is based on the angle of between two vectors and does not take magnitude into account. The function below from the original post performs cosine similarity on the original matrix (before L1 normalization). Entering the name of a sponsor will return the top 10 most similar sponsors.

def cos_sim(row):
    """
    This function takes in the name of the row and prints the top 10 most similar sponsors, utilizing cosine distance.
    Heavily inspired from Professor Cody Butain's code as seen below:
    https://github.com/cbuntain/umd.inst414/blob/main/Module03/02-Similarity.ActorsGenre.Normed.ipynb
    
    Inputs:
    row(string): A string representing the row 
    """

    #Gathering the genres for that sponsor
    target_sponsor = df.loc[row]

    #Generating distances from that sponsor to all the others
    distances = scipy.spatial.distance.cdist(df, [target_sponsor], metric="cosine")[:,0]

    query_distances = list(zip(df.index, distances))

    #Printing the top ten most similar sponsors to our target
    i=1
    for similar_sponosor, similar_dish_score in sorted(query_distances, key=lambda x: x[1], reverse=False)[:10]:
        print(f"{i}.", similar_sponosor, similar_dish_score, df.loc[similar_sponosor].sum())
        i+=1

Comparing similarities with the original post, “Similarities Between Ocean Liner Menus and More,” can help to identify if similarities will show in clustering. Below is the top 10 most similar sponsors for the maritime company Cunard Line from the original similarity analysis:

1. CUNARD LINE

2. USMS

3. USMS ST Louis

4. U.S.M.S.

5. Headquarters 47th Infantry US Volunteers

6. U.S.M.S

7. RED STAR LINE

8. OCEAN STEAMSHIP CO.

9. D&H DINING CAR SERVICE

10. BATTERY PARK HOTEL

The following are the top 10 most similar sponsors from the current analysis.

1. CUNARD LINE

2. LAKEWOOD COUNTRY CLUB

3. TAMPA BAY HOTEL

4. RED STAR LINE

5. HAMBURG-AMERIKA LINIE

6. NORDDETSCHER LLOYD BREMEN

7. NIPPON YUSEN KAISHA

8. PRINCESS HOTEL

9. Waldorf Astoria

10. ST. REGIS HOTEL

The two similarity analysis both show one common trend: Menus from Cunard Line are similar to menus served by high-class sponsors that provided overnight accommodations. Although not shown, this is also true when comparing the top 10 most similar sponsors for Norddeutscher Lloyd Bremen and Occidental & Oriental Steamship Company (these can be compared in the Jupyter Notebook files available on GitHub). As the original similarity analysis came to a similar conclusion, it can be concluded that adding additional sponsors did not change the conclusions made by the original similarity analysis. This also begins to paint a picture of a potential cluster; high-class sponsors providing overnight accommodations.

Attempting to Cluster

The clustering for this analysis will be using sklearn’s clustering modules. The following two blocks of code, modified from Professor Cody Buntain’s GitHub, attempts to find a proper k-value utilizing the elbow method. Please note that because of the large amount of data, these blocks will take longer to execute.

df_norm.columns = df_norm.columns.astype(str) #ensure columns are strings

# Let us test different values of k
interia_scores = []

for test_k in sorted(set(np.random.randint(2, 25, 20))):
    print("Test k:", test_k)
    
    tmp_model = MiniBatchKMeans(
        n_clusters=test_k,  
        n_init=16, max_iter=2048, tol=0.5, reassignment_ratio=0.5
    )
    tmp_model.fit(df_norm)
    
    score = tmp_model.inertia_
    interia_scores.append((test_k, score))

intertia_df = pd.DataFrame(interia_scores, columns=["k", "score"])

fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot(1,1,1)

intertia_df.sort_values(by="k").plot("k", "score", ax=ax)

ax.set_ylabel("Intertia")

plt.show()

The graph procuded by the block of code above.

Several iterations of this code were used until this graph were produced. Values in the first block of code had to be reduced as it was taking far too long to produce test k-values. Additionally, larger values produced straighter lines in the graph. Smaller values produced a shape closer to an elbow. Although the graph does not have an ideal elbow shape, a k value of 18 was decided on. The following three blocks of code perform the clustering.

k = 18
cluster_model = KMeans(n_clusters=k)
cluster_model.fit(df_norm)

cluster_labels = cluster_model.predict(df_norm)
loco_cluster_df = pd.DataFrame(cluster_labels, index=df_norm.index, columns=["cluster"])
loco_cluster_df["cluster"].value_counts()

for cluster,locos in loco_cluster_df.groupby("cluster"):
    print("Cluster:", cluster, "Size:", locos.shape[0])
    
    for state in locos.index:
        print(state)

The following cluster sizes emerged. A description of the sponsors in each cluster follows.

· Cluster 4: 1638

This cluster contains a wide variety, seemingly random sponsors. There is a plethora of hotels, restaurants, maritime organizations, and railroads.

· Cluster 11: 512

Much like Cluster 4, this cluster contains a wide variety, seemingly random sponsors. There seems to be no trend.

· Cluster 7: 344

No trend

· Cluster 3: 73

No trend

· Cluster 9: 7

No trend

· Cluster 14: 5

No trend

Note: The rest of the clusters had either 2 or 1 sponsors in each. You can see the full list of clusters and sponsors within each cluster on the GitHub repository for this project at the bottom of the medium post.

Discussion

Can unsupervised learning accurately group together sponsors in the New York Public Library’s “Whats on the menu?” database in a meaningful way, and if so, what are the groups?

Unsupervised learning can cluster sponsors in the New York Public Library’s “Whats on the menu?” database. As described above, they do not seem to be meaningful in anyway. To answer the question, no, unsupervised learning cannot cluster sponsors in the database. The rationale behind using unsupervised learning in this analysis was that similar dishes would “automatically” show connections between sponsors. However, it seems that the dishes are more similar than what was assumed during “Similarities Between Ocean Liner Menus and More.” Unfortunately, this analysis would not be able to provide the New York Public Library with categories of sponsors.

Limitations & Future

Time restrictions did not allow for an exploration of dishes in this analysis. A basic overview of common dishes between sponsors could assist in drawing conclusions. Future work could see an analysis of common dishes in the database. This could help in creating a supervised method in categorizing sponsors based on their dishes. For example, a decision tree could be used to sort through the different dishes. If a restaurant serves more seafood than burgers and fries, it could be categorized as a seafood restaurant. Repeating this for a variety of different dishes could potentially create more meaningful clusters than what was created in this analysis.

Conclusion

This is my final medium post for INST414: Data Science Techniques at the University of Maryland. Several techniques were used from the class, L1 normalization, cosine similarity, and clustering with unsupervised learning. Although this analysis did not provide any actionable insight, it does provide a view of key concepts learned in this class. Future work in exploring dishes with a supervised method could provide more meaningful clusters that this analysis created.

Resources

New York Public Library “Whats on the menu?”: https://menus.nypl.org/

GitHub Repository: https://github.com/smiller1551/MenuAnalysis3

Professor Cody Buntain’s Github: https://github.com/cbuntain

This Medium post was created by Simon Miller at the University of Maryland — College Park for INST414: Data Science Techniques under Professor Cody Buntain.

Attempting to Cluster Historic Menus

Author’s Note

The Question

The Data

Similarity Analysis

Attempting to Cluster

Discussion

Limitations & Future

Conclusion

Written by Simon Miller