A Recommendation Engine for The Recipes by Using Collaborative Filtering in Python

Julia Wu
Web Mining [IS688, Spring 2021]
11 min readApr 21, 2021
source: https://www.dreamstime.com/

One of my hobbies is cooking, and I came across an awesome website Food.com that offers tons of recipes for all kinds of dishes. Its search engine helps me find the recipes for a specific dish and provides me some related dishes that use similar ingredients or are in the same category. Although these features are great, sometimes I also want to try other different styles of dishes. However, since no specific keywords come to my mind, it often takes me some time to find some interesting ones from tons of recipes. In this situation, the recommendation engine based on other users may be able to help me find the recipes that I am interested in more quickly. Therefore, I am wondering about

How to build an engine to get the recommended recipes based on other users’ choices by using memory-based collaborative filtering? How do I evaluate the performance of the recommendation engine?

Source code is in my GitHub!

Collect the data

This time I download a pre-existing dataset for food.com from Kaggle, which includes two CSV files. One is the interactions_train CSV file containing around 160,000 recipe IDs rated by 25,000 user IDs in a total of 699,000 records, and the other is the RAW_recipes CSV file containing 230,000 recipes with names, ingredients, description, and steps, etc. From the bar chart, we can find that most ratings are five-star.

import numpy as np
import pandas as pd
I = pd.read_csv('recipe/interactions_train.csv')
R = pd.read_csv('recipe/RAW_recipes.csv')
I.info()
R.info()

Clean up data

For ease of reading the data, I read the two files as data frames and clear up the data by only keeping the columns that I need from the two data frames for building this recommendation engine. From the interactions_train data frame, I keep the columns, user_id, recipe_id, rating.

Bug Attention!

When I try to apply cosine similarity to the interactions_train data frame in the next steps, I receive an error message which shows me that my machine does not have enough memory space.

To solve this issue, I decide to select rows that the user is in the top 7,500 users who give the most reviews and the recipe is in the top 7,500 recipes that receive the most reviews. To get the top 7,500 users who give the most reviews, I group this data frame by “user_id” and aggregate it by the number of “recipe_id” to get a new data frame, grouped_1. To get the top 7,500 recipes that receive the most reviews, I group this data frame by “recipe _id” and aggregate it by the number of “user_id” to get a new data frame, grouped_2. Finally, I inner join the interactions_train data frame with grouped_1 and then with ” to get a new data frame, grouped_2 to get a data frame “_part” containing all qualified rows in a total of around 222,000 rows. In “_part,” there are 7,481 unique users and 7,500 unique recipes. This data seems to be sparse since the users only rate about 30 recipes on average, and the recipes only get about 30 ratings on average. We can also find that the distribution of rating in the sample is similar to the distribution of rating in the population.

print('unique users:',len(_part.user_id.unique()))
print('unique recipes:',len(_part.recipe_id.unique()))

Bug Attention!

In the next steps, when I try to map the values of rating into the train and test data sets, I receive an error message saying that the index is out of bound. To solve this issue, I use the dictionaries to assign new IDs to the qualified users and recipes in the range of the length of their lists containing unique values accordingly. The clean data frame for “interactions_train” is “df.”

df = _part.replace({'user_id': new_userID, 'recipe_id': new_recipeID})

From the RAW_recipes data frame, I keep the columns, name, ingredients, and recipe_id, and only keep the rows also in the _part data frame by right joining “_part” on the “RAW_recipes.” The clean data frame for “RAW_recipes” is “recipe.”

Although there is a recipe without a name in the “RAW_recipes”, the recipe does not in the interactions_train data frame. Therefore, nothing needs to be done.

Adjust the ratings

For two advantages, I am going to apply centered cosine to the “df.” First, it minimizes the differences between the “hard raters” and “easy raters.” Second, it makes sense when I fill all the unrated recipes with “0” for the qualified users since the mean of the ratings for each qualified user is “0.”

Split data sets

To measure the performance of this recommendation engine in the next steps, I am going to split the “df” into ¾ train data set and ¼ test data set. Since I use dictionaries to assign new IDs to the qualified users and recipes in the range of the length of their lists containing unique values accordingly, I can map the values of rating into the train and test data sets by their indexes.

from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size = 0.25)
n_users = df.user_id.unique()
n_items = df.recipe_id.unique()
train_data_matrix = np.zeros((n_users.shape[0], n_items.shape[0]))
for row in train_data.itertuples():
train_data_matrix[row[1]-1, row[2]-1] = row[3]
display(train_data_matrix.shape)
display(train_data_matrix)
test_data_matrix = np.zeros((n_users.shape[0], n_items.shape[0]))
for row in test_data.itertuples():
test_data_matrix[row[1]-1, row[2]-1] = row[3]
display(test_data_matrix.shape)
display(test_data_matrix)

Centered cosine similarity

There are two major types of memory-based collaborative filtering.

User-based: “Users who are similar to you also like …”

Item-based: “Users who like this item also like …”

Since I am not sure which type has a better prediction for this data set at this point, I am going to calculate the cosine similarity for both types. I use the cosine similarity from the “SKLearn” library to calculate the similarity between all qualified users as user_similarity and the similarity between all qualified recipes as item_similarity. In case of user-based collaborative filtering, the similarity is computed along the rows of the matrix, whereas in case of item-based collaborative filtering, the similarity is computed along the columns. The concept is to measure the cosine of the angle between two vectors projected in a multi-dimensional space. Since I center the mean to “0” for each qualified user, the cosine would be between 1 and -1. The smaller the cosine is, the more unlikely two qualified users or recipes are similar.

from sklearn.metrics.pairwise import pairwise_distances
user_similarity = 1 - pairwise_distances(train_data_matrix, metric = 'cosine')
display(user_similarity.shape)
display(user_similarity)
item_similarity = 1 - pairwise_distances(train_data_matrix.T, metric = 'cosine')display(item_similarity.shape)
display(item_similarity)

Prediction

Since I just want to get the recommended recipes with a relatively high rating in the prediction, I calculate the relative ratings instead of real ratings. For the prediction by using the user similarity, I let the matrix of user similarity dot the matrix of the train data set to get a weighted sum. Weighted sum computes the prediction on an item “i” for a user “u” by calculating the sum of the adjusted ratings given by the users similar to “u” on “i”, and each adjusted rating is weighted by the corresponding similarity between “u” and its similar users. Then the weighted sum is scaled by the sum of the similarity to normalize the prediction. Now I can get the relative ratings in a matrix for the user-based collaborative filtering. It means this engine will recommend the recipes that are liked by the users who are similar to me. For ease of reading, I convert the matrix into a data frame.

user_pred = predict(train_data_matrix, user_similarity, _type = 'user')
display(user_pred.shape)
display(user_pred)
user_pred_df = pd.DataFrame(user_pred, columns = list(n_items))
user_pred_df.insert(0, 'user_id', list(n_users))

For the prediction by using the item similarity, I let the matrix of the train data set dot the matrix of item similarity to get a weighted sum. Weighted sum computes the prediction on an item “i” for a user “u” by calculating the sum of the adjusted ratings given by the user on the items similar to “i”, and each adjusted rating is weighted by the corresponding similarity between “i” and its similar items. Then the weighted sum is scaled by the sum of the similarity to normalize the prediction. Now I can get the relative ratings in a matrix for the item-based collaborative filtering. It means this engine will recommend the recipes that are liked by the users who like a recipe that I like. For ease of reading, I convert the matrix into a data frame.

item_pred = predict(train_data_matrix, item_similarity, _type = 'item')display(item_pred.shape)
display(item_pred)
item_pred_df = pd.DataFrame(item_pred, columns = list(n_items))
item_pred_df.insert(0, 'user_id', list(n_users))

Evaluation of the predictions

To decide which type of memory-based collaborative filtering should I use to build this recommendation engine, I am going to evaluate the predictions by using RMSE, Root Mean Square Error. It helps me determine the square root of the average squared difference between the predicted and the actual value of a variable by comparing the matrix of the prediction and the matrix of the test data set. I use the “mean_squared_error” from the “SKLearn” library to calculate the MSE and then get the square root of it, which is RMSE. Usually, an RMSE score less than 180 is considered a good score for a model. Since I use centered cosine to adjust the ratings, the RMSEs of both types tend to be small. From the comparison of the two RMSE scores for the user-based and item-based collaborative filtering, the user-based collaborative filtering seems to have a little bit better performance than the item-based one for this data set. Therefore, I am going to build the recommendation engine by using user-based collaborative filtering.

from sklearn.metrics import mean_squared_error
from math import sqrt
def RMSE(prediction, ground_truth):
prediction = prediction[ground_truth.nonzero()].flatten()
ground_truth = ground_truth[ground_truth.nonzero()].flatten()

return sqrt(mean_squared_error(prediction, ground_truth))
user_RMSE = RMSE(user_pred, test_data_matrix)
item_RMSE = RMSE(item_pred, test_data_matrix)
print('user_RMSE = {}'.format(user_RMSE))
print('item_RMSE = {}'.format(item_RMSE))

The recommendation engine

At this step, I am going to build a recommendation engine that can recommend a user the top N unrated recipes sorted by the predicted ratings in a descending order based on the user-based collaborative filtering, and the engine can retrieve the names and ingredients of the top N recipes by using the recipe ID as a foreign key between “df” and “recipe” data frames.

Bug Attention!

Since I use dictionaries to assign new user IDs and recipe IDs to the qualified users and recipes in the “df” at previous steps, I have to reverse the new recipe IDs to the original ones by using the dictionaries again before I use the recipe ID as a foreign key to retrieve the names and ingredients of the recommended recipes in the “recipe.”

R1_UserBased = getRecommendations_UserBased(702)
R2_UserBased = getRecommendations_UserBased(408, 5)
R3_UserBased = getRecommendations_UserBased(204, 7)

Conclude with limitations

In general, I download a pre-existing dataset for food.com from Kaggle. After I clean up the data, there are around 222,000 rating records that the user is in the top 7,500 users who give the most reviews and the recipe is in the top 7,500 recipes that receive the most reviews. After I use dictionaries to assign new user IDs and recipe IDs to the qualified rows and use centered cosine to minimize the differences between the “hard rates” and “easy raters,” I map the values of rating into the train and test data sets by their indexes and split the data into train and test data sets at a ratio of three to one. I use cosine similarity to calculate the similarities between all qualified users and the similarities between all qualified recipes, and then I calculate the predicted ratings based on the user-based and item-based collaborative filterings. Furthermore, I use RMSE to evaluate the performance of their predictions. By compared the RMSE of user-based collaborative filtering to that of the item-based one, I use the user-based collaborative filtering to build the recommendation engine since it seems to have a little bit better performance for this data set. Since I reverse the new recipe IDs to the original ones by using the dictionaries again before I use the recipe ID as a foreign key to retrieve the names and ingredients of the recommended recipes, the recommendation engine can recommend a user the top N unrated recipes sorted by the predicted ratings in a descending order based on the user-based collaborative filtering, and the engine can retrieve the names and ingredients of the top N recipes from the recipe data frame. Therefore, qualified users can get recommended qualified recipes based on other similar qualified users’ ratings.

There are two major limitations to this project. First, due to insufficient memory space to compute the cosine similarity, I select around 222,000 records as qualified records that are about one-third of the total. Since these records are that the user is in the top 7,500 users who give the most reviews and the recipe is in the top 7,500 recipes that receive the most reviews, there is no cold start in this project which may be a plus. Second, the data seems to be sparse. There are 7,481 unique users and 7,500 unique recipes in the rows I select. However, the users only rate about 82 recipes on average, and the recipes only get about 37 ratings on average. Usually, the sparser the data is, the worse the performance of the prediction will be.

References

“Item-to-Item Based Collaborative Filtering.” GeeksforGeeks, 16 July 2020, https://www.geeksforgeeks.org/item-to-item-based-collaborative-filtering/.

Li, Shuyang, and Bodhisattwa Majumder. “Food.Com Recipes and Interactions.” Kaggle, 2019, https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions?select=RAW_recipes.csv.

“RMSE — Root Mean Square Error in Python.” AskPython, https://www.askpython.com/python/examples/rmse-root-mean-square-error. Accessed 18 Apr. 2021.

Sarwar, Badrul, et al. “Item-Based Collaborative Filtering Recommendation Algorithms.” Proceedings of The 10th International Conference on World Wide Web, 2001, http://files.grouplens.org/papers/www10_sarwar.pdf.

Spark, Cambridge. “Implementing Your Own Recommender Systems in Python.” CAMBRIDGE SPARK, 23 Jan. 2020, https://blog.cambridgespark.com/nowadays-recommender-systems-are-used-to-personalize-your-experience-on-the-web-telling-you-what-120f39b89c3c.

--

--

Julia Wu
Web Mining [IS688, Spring 2021]

Certified Google Advanced Data Analytics Professional | connect with me over LinkedIn: julia-h-wu/