What to Drink Next? — A Simple Beer Recommendation System using Collaborative Filtering

Medford Xie
Jun 3 · 9 min read
Typical Beer Section at WholeFoods

If you ever found yourself staring at a wall of beers at your local supermarket, contemplating for over 10 minutes, scouring the Internet on your phone looking up obscure beer names for reviews, you are not alone. As a beer enthusiast and home brewer, I’m always looking for something new to try but I also dread being disappointed so I often spend too much time looking up a particular beer over several websites to find some kind of reassurance that I’m making a good choice.

Hence this was my motivation behind this project — to utilize machine learning and readily available data to create a beer recommendation engine that can curate a customized list of recommendations in milliseconds. The repo for this project can he found here.


Before getting into the notebook, lets quickly review collaborative filtering.

Source: researchgate.net

For those who are unfamiliar with Collaborative Filtering (CF), CF utilizes past transactions or user behavior to find connections or similarities between users and products. In essence, you are trying to find items that are similar to one another based on how your collective user base have rated these items, or you can find users who are similar based on the items that they have mutually rated similarly.

In this project I use two different approaches: neighborhood approach (kNN) and latent factors (SVD). For more information about these two methods you can check out my 3 part blog.

Neighborhoods Approach using Surprise

The first step was to gather the data. Using Python and BeautifulSoup, I scraped all of my data from BeerAdvocate, a popular beer review website. The data was store in a MongoDB database hosted on AWS. The models were trained using sci-kit learn and the Surprise libraries.

I ended up scraping all of the top rated beers (top 100 per subcategory) across 7 categories of beer styles. The resulting data encapsulated 56 subcategories of beers; 4,964 unique beers; approximately 88 thousand unique user(reviewers); approximately 1.4 million user-review pairings (e.g. User1: Rating1).

User-review pairs

To improve results of the recommender, the beers in the bottom 10 percent in terms of average rating and review counts were removed. This removed about 20% of the beers, bringing the unique beer count from 4964 to 3959. I want to note when using the Surprise library, the “shrinkage factor” (for imputing missing ratings) takes into account the number of reviews for each item/user. It punishes the items/users with very little reviews. Therefore this step is not critical.

The Surprise Library is a well documented (if you know where to look) and intuitive library for building recommender systems. In this case, the KNN prediction algorithm was used. It is called a prediction algorithm because it imputes missing ratings with a baseline estimates (see Surprise documentation or my blog for more info). I will cover the key elements in implementing the code here.

This step is to map each beer and user name to a unique ID, in this case is an integer since the index value of the dataframe are integers.

The read_item_names( ) function is used to convert the input beer names to raw ids, and vice versa.

The get_rec( ) function returns the k nearest recommendations based on item similarity, after the model has been trained. You can see here that each raw id is mapped to a unique integer called inner id - this is to make it more suitable for Surprise to manipulate.

The above codes was taken from the Surprise FAQ documentation and modified.

2. Train and Evaluate Model

reader = Reader(rating_scale=(1,5))data = Dataset.load_from_df(merged_df2[['userID', 'beerID', 'rating']], reader)trainset = data.build_full_trainset()sim_options = {'name': 'pearson_baseline', 'user_based': False}algo = KNNBaseline(sim_options=sim_options)algo.fit(trainset)

Running the above code block will allow you to train the model with your data. The Reader function is to normalized the data (ratings are from a scale of 1 to 5). The Dataset.load_from_df( ) and data.build_full_trainset( ) functions are built-in Surprise functions allowing you to load in your entire dataframe and it will build the training set for you. The sim_options allows you to specific the type of similarity measure to use, for example, Pearson vs. cosine distance vs. mean squared distance. ‘User based = False’ indicates that this is item-item similarity and not user-user similarity.

evaluate(algo, data, measures=['RMSE', 'MAE'])

Running the evaluate method will return a 5-fold cross validation with the specified metric(s). This measures how well the algorithm predicts a missing rating against the actual rating. For this beer recommender, the RMSE (root mean square error) was .4 and the MAE (mean absolute error) is .28. Not bad given that the ratings are on a scale of 1 to 5.

Essentially, the algorithm will take a sparse matrix (eg. see below) and impute all of the missing values with a baseline estimate (r_ui). In other words we now have predictions for how every user will rate every single beer in this matrix. The kNN algorithm then measures the distance between the ratings to determine the “closeness” and return the nearest neighbors.

Figure 1 — Ratings Matrix

Now lets test out this recommender.

The top 20 nearest neighbors for the “Enjoy By IPA” from Stone brewing returned a list of quite a few other IPAs but there were also stouts and an amber ale among others. In this list I recognized 3 IPAs from the same brewery, and the ‘Blazing World” is a unique hoppy amber ale that tastes like an IPA. Not too shabby!

Latent Factors Approach with Truncated SVD

Alternatively, we can use a latent factors approach by using a Truncated SVD. The surprise package also have an algorithm for doing SVD and other matrix factorization techniques, however, it would require some deep dive into the code and making some modification in order to be able to run a truncated SVD. The default SVD module in Surprise does not allow for a truncated SVD. The reason why a truncated SVD is needed is because a standard SVD would require too much memory and processing power. Given that we have a matrix of about 5K x 88K, it would not be feasible.

First we need to format our data into a pivot table, and fill any NaNs with 0. The truncated SVD does not handle sparse matrices with null values.

user_reviews_df2_pivot = user_reviews_df2.pivot_table(index='user', columns='beer_name', values='rating').fillna(0)

Then we will transpose it so that the items becomes the rows (for item-item similarity).

In order to decide how many components to choose for the truncated SVD, I created a simple function to return the explained variance ratio for each component in the input.

Graphing the results will yield a graph that shows the relationship between the number of components and the explained variance ratio. The explained variance tells you how much of the variance that the model captures .

N-components vs. Explained Variance Ratio

As you can see, the number of components needed to achieve incremental gains in explained variance grows exponentially. Or in other words, a small amount of latent features (components) will capture the majority of the variance.

Based on the above information, I decided to use 200 components for this truncated SVD. I believe this is fair given that there is only 3900 unique beers (after removing the bottom 10%-tile in average ratings and review counts) over 56 subcategories, and at 200 components the model captures 2/3 of the explained variance. Furthermore, I find it hard to imagine the common beer drinker being able to differentiate between 200 different “styles” of beer.


Side Note on Latent Features

Each component is a latent feature that a particular beer and user has an affinity to, and given that there is only 3900 beers, you can say that there are about 20 (3900 divide by 200) beers per latent factor.

Another way to think about this is to think about how many different ways you can classify or describe the types of beers that are available. Even within the 56 subcategories, many of these subcategories are very similar to one another, most people probably wouldn’t be able to tell the difference. However, these latent factors encapsulate features that are hard to explain. For example, a particular latent factor could be characterized as “slightly hoppy + low alcohol content + light bodied + high carbonation” or a particular subcategory (eg. West Coast IPA). So you might think there are an infinite amount of combinations, but this is not the case.

If you think about the different criteria for characterizing beers, they may be:
* bitterness (or hoppiness)
* sweetness level
* carbonation level
* aroma
* body of the beer

Lets say there are 3 levels for each criteria (high, mid, low), then there are 243 combinations (3 to the 5th power) — and that is assuming each criteria is independent of each other which is not the case (hoppy beers tend to have a strong aroma, sweeter beers have a fuller body).


The following code blocks performs the matrix decomposition and dimensionality reduction and gives us a similarity matrix that we can use to find the “closest” neighbors.

# create the SVD
SVD200 = TruncatedSVD(n_components=200,random_state=200)
matrix200 = SVD200.fit_transform(T)
# get the similarity matrix
corr200 = np.corrcoef(matrix200)
# get list of all beer names
beer_rec_names200 = merged_df2_pivot.columns
beer_rec_list200 = list(beer_rec_names200)

The following function is used to find the top n neighbors given a input beer name.

Lets test this out on the same beer earlier.

This list is very different from the one that the kNN method yielded, with only 3 beers showing up in both list — ‘Lunch’, ‘Sculpin IPA’, and ‘Grapefruit Sculpin IPA’.

To compare differences between the two methods. I took a subset of beers (every other 10 beer sorted by highest to lowest average ratting) and plotted the number of overlapping beers returned by both methods for the top 50 recommendations.

As you can see, out of the top 50 recommendations, the amount of overlap between the two methods ranged from 0 to 20 which translates to 0–40%. So the two different methods returns very different results and this is expected because a SVD captures relationships that the nearest neighbor approach does not.

Conclusion:

Building a collaborative filtering recommender system can be quite simple. The Surprise Library is a great module to use. You can use a neighbors approach (KNN) or a latent factors approach (SVD). Both methods may yield very different results, however, this may actually be a good thing as you can combine both results to create a sort of hybrid recommender. Collaborative filtering is great if you want your recommendation engine to provide diverse or varied results. Otherwise you can just use a content-based recommender, which in this case would recommend mostly beers within the same subcategory. Combining results from two or all three (content-based, kNN, SVD) would probably be your best bet.

References: