Fetching Better Beer Recommendations with Collie (Part 1)

Nate Jones
10 min readMay 4, 2021


Getting data, training a model, and talking about beer!

Part 1 (you are here) | Part 2 | Part 3

An image of the Collie dog looking at glasses of beers with a question mark thought.

TL;DR — I talk about ShopRunner’s latest open source library, Collie [GitHub, PyPI, Docs], for training and evaluating deep learning recommendations systems. We use Collie to train a model to recommend beers and see some recommendations for a user. But can we do better?

This is the first blog post in a three-part series on Collie. Each blog will build on the previous one, so it’s highly recommended you read them in order (and code along in a Jupyter notebook if you’d like). If you’d like to skip ahead to a subsequent blog post, you can find all the code used in these posts in the GitHub Gist here.

The World of Beer & Acronyms

About a year and a half ago, I finally turned 21, meaning I could now walk into any alcohol-serving establishment and legally order a drink. Feeling completely liberated from the adolescent chains holding me back, I walked up to the bar, looked the bartender in the eyes, opened my mouth, and… had no idea what to order.

It turns out that in the real world, there are many, many kinds of drinks, and even a seemingly safe bet like beer is, in fact, a loaded question, with thousands of different options to choose from.

To truly understand beer means understanding a plethora of acronyms. IPA? ABV? IBU? OG? SRM? Is BEER, itself, an acronym?! I learned soon enough that choosing any beer (let alone a good one) was truly a daunting task.

Luckily for me, we live in the age of technology. Why should I actually spend the time learning about the nuances of beer when I could just leverage a machine learning algorithm to recommend the best beers to me automatically?!

A four-paneled meme with two people talking. The first panel, a person says “I want an ML model to recommend a beer for me.” The second panel, a person replies, asking “Why not just ask the bartender?” while handing the first person a filled manilla envelope. The original person replies while setting the folder on fire, “I don’t want to ask the bartender. I want an ML model to recommend a beer for me.”

Recommendations 101

At ShopRunner, making quality recommendations to our members is incredibly important. We do this in a number of ways, mainly through recommending members new products based on their order or viewing history, or recommending new products based on an item they are currently looking at.

Recommendations for gowns. The seed image is a fancy gown, and we show four recommendations for gowns.
Standard recommendations — if you’re looking at a gown, you might also like to view these other gowns.
Recommendations for pants. The seed image are casual pants, and the recommendations include more casual pants, with one recommendation being a belt.
Notice the belt recommended here, which is often bought alongside these pants.
Recommendations for sports apparel. The seed item is a hoodie for the Golden State Warriors team. All recommendations are various apparel for the Golden State Warriors, including shorts, jerseys, shirts, and hoodies.
Even without explicitly knowing what team the clothing is for, our Collie model is smart enough to pick up on these patterns and only recommend Golden State Warriors merchandise. Nice!

Today, we are able to do all of this with a single, deep learning recommendations model, using a library we created in-house, Collie, which we have recently open sourced!

To show how powerful Collie is for a wide variety of recommendations tasks, we’ll use it today to solve my existential beer crisis.

The Data

While effective, it wouldn’t be very efficient for me to try and rate every beer ever made to find which ones I like and dislike. Luckily, there exists sites like BeerAdvocate and RateBeer that host thousands of reviews of people drinking and rating different beers. We can use data scraped from these sites to learn from other people’s reviews and generalize this to make new recommendations for users.

An image of our dataset in a text format. Fields include beer name, beer ID, brewer ID, ABV, style, appearance, aroma rating, palate rating, taste rating, overall rating, time, profile name, and review text.
Here are two example reviews from our dataset before any preprocessing.

While we could just use each assigned numerical rating from 1–10 at face-value, there usually aren’t many data sources like this in the “real world.” At ShopRunner, for example, our data does not show whether a user viewed a product and loved it, or viewed a product and hated it — we only know that they viewed it. This general idea extends to other data sources as well — most Google search results, most iTunes film suggestions, and most Instagram post orderings don’t really know if the user loved, liked, disliked, or hated the results. For these data sources, we mostly have to rely only on a binary label (interacted with / did not interact with) known as the indicator. It’s a lot easier to make a model to make recommendations when we have ratings, which is why so many recommendations blog posts and papers assume this data format.

When we don’t have the explicit rating a user gives an item, this is known as implicit data, which Collie just so happens to excel in! Thus, for this blog post, we’ll go an extra step and challenge ourselves not to rely on the numerical rating in the model, but only the indicator of whether or not this user tried this beer.

A GIF of Princess Leia from Star Wars shaking her head, saying “You make it so difficult sometimes.”

Preprocessing the data is simple enough, the only challenge comes with parsing through the raw text format into something less… bad.

And then turning that list of dictionaries into something even more less… bad.

Once the data is in a familiar Pandas DataFrame format, we can use some Collie helper functions to convert the explicit data to implicit, then drop any users who made fewer than two reviews (since we can’t really learn a user’s preference from a user who only reviewed a single beer, and we’re going to create two datasets, a train and validation, to ensure our model generalizes well).

At this point now, we have a dataset with 3,679,058 reviews between 60,786 users and 160,803 beers.

Now, we can easily use Collie to create an Interactions object, which is the core of how data loading and retrieval works in Collie models. In short, Interactions is an efficient, simple, data-quality-checking PyTorch Dataset. You can read more about Collie’s Interactions objects here.

This is also the expected format for Collie’s data splits, models, and evaluation metrics, so it’s important we create this now.

Since the data is now in an Interactions format, we can easily split the data using a stratified split such that every member appears in both the train and test datasets at least once. And luckily for us, this function is built right into Collie!

And… that’s it! Wasn’t too bad, right?! Let’s go ahead and finally train a model!

A GIF from the show Kim’s Convenience. Two men sit at a counter and look out the window, saying “And so it begins.”

Training an Implicit Recommendations Model

We’ll start here with one of the simplest collaborative filtering model architectures: matrix factorization using a dot product. This model is incredibly simple, with only four layers in the entire model. For this example, every user and every item will get its own length 30 vector of numbers (an arbitrary decision, this could be bigger or smaller if you’d like) known as an embedding and a single number known as the bias term. Over time, the model will learn the taste preference of each user and user preference of each item and represent that information through the embeddings, and adjust the final recommendation scores slightly with the bias terms (if an item has high mass appeal value to many users, our model might learn to give this item a higher bias score so its recommendation score is boosted a bit in most situations).

An architecture diagram showing how matrix factorization with a dot product works. We have a table for user embeddings, a table for user biases, a table for item embeddings, and a table for item biases. We take a user’s row and an item’s row and dot product the two embeddings together and add the biases to get the final recommendations score.
This is actually the second draft of this image. The first one was completely drawn by me, which my partner said looks like a three year old drew. So thank you to Jen for illustrating this second draft.

We can instantiate and train Collie models with a few lines of code:

Yup — that’s it! Collie utilizes PyTorch Lightning to train all models, meaning that with a few arguments alone, we can specify the device we train on (CPU, GPU, or TPU), how we log model results, how we checkpoint the model, and so much more (see PyTorch Lightning docs here for all supported options we can use).

Evaluating Our Model Using Numbers

Collie also has some common implicit evaluation metrics built-in, including Mean Average Precision at K (MAP@K), Mean Reciprocal Rank (MRR), and Area Under the ROC Curve (AUC), that support using the GPU to do all the heavy lifting quickly and efficiently. Again, evaluating the model is simple with Collie. Here, we’ll only use MAP@10 for evaluation, but each metric listed prior shares the same API.

Rather than show the gnarly formula for MAP@10, you can think of this number as a proxy for how well our model is performing. MAP@10 is similar to the percentage of time we expect someone has tried at least one of the ten beers our model recommends for that user, but goes a bit deeper than this by considering the order of the recommendations as well. For example, if we have two models correctly recommend a certain beer to a user, but one model presents this as the second recommendation and the other model presents this as the ninth recommendation, our first model would have a higher MAP@10.

It’s not necessary to exactly understand the formula, but instead just gain an intuition of the metric.

An image showing example recommendation results and their resulting MAP@10 score. Of particular note, the first row all recommendations are wrong, and the score is 0. The second row has the first seven recommendations wrong and the last three right, and the score is 0.065. The third-to-last row shows the first three recommendations being right and the last seven being wrong, and the score is 0.340. When all ten recommendations are correct, the score is 1.
Note that in both the second and third-to-last rows, our model correctly recommends three out of ten items. However, order matters, so this difference in the order we recommend the items dramatically affects our MAP@10 score.

Some things to note:

  • A perfect model is able to recommend the exact right beers for every single one of our users, starting with the beer that user is most likely to be interested in. This perfect model would get a MAP@10 score of 1. While it’s nice to daydream about perfection, the truth is that recommendations are very, very difficult, because humans are very, very unpredictable. You may have a user who only drinks Pilsners, who just happens to hate Stella Artois (another Pilsner) for seemingly no reason.
A GIF of the character Teddy from Brooklyn 99 at a dinner table saying, “I never need a drink menu. I got the thrills for the pils. ’Cause I’m a pilsner man.”
Am I… turning into Teddy?!?
  • Not only that, but if we recommend a new beer that our user might actually like, but they just haven’t tried yet, our model gets a lower MAP@10 score, since our test dataset only has a certain number of beers for each user. It would be pretty unfeasible to go to each user in our dataset, have them try the beer our model recommends, jot down their review, and then evaluate the model using this instead.
  • Lastly, we have a search space of over 160K beers — narrowing that huge search space down to just ten beers personalized to every user is challenging. A model that is just guessing ends up getting a MAP@10 score of 0.000006 — which is low!

All of this to say, MAP@10 scores always look worse than they really are. To show just how much our model is learning, I’ve included the MAP@10 score for a random, uninitialized model that is essentially guessing recommendations for each user. Remember that higher scores are better!

A table with two columns: one for Model name and one for MAP@10 score. The first row has an untrained model that is randomly initialized with a MAP@10 score of 0.00001. The second row shows a standard matrix factorization model with a MAP@10 score of 0.01364. Going forward, we will refer to this table as the “MAP@10 table.”
WAY better than an untrained, randomly initialized model.

Seeing Some Recommendations

Clearly, our matrix factorization model is doing a lot better than the untrained, randomly initialized model (as evidenced by the nearly 1,400x boost in MAP@10 scores), but a number alone is not enough to ensure that our model is actually learning. For this, let’s find a user in our dataset who has interacted with beers in the past, and do them the favor of recommending new beers for them.

So let’s make a small script to do just that! The code snippet below selects a random user, looks at the beers they have previously interacted with, then makes some new beer recommendations for them using our previously-trained matrix factorization model, free of charge!

I randomly sampled user 53399, who has reviewed a total of 82 beers (which is somehow a number of reviews on the low-end — one user reviewed 2,309 beers — HOW?!). A random five of these 82 reviewed beers include:

  • Founders Imperial Stout (a Russian Imperial Stout)
  • Founders Breakfast Stout (an American Double / Imperial Stout)
  • Bengali Tiger (an American IPA)
  • Brooklyn BAMboozle Ale (a Belgian Pale Ale)
  • Jai Alai IPA (an American IPA)

Sadly, I’m not enough of a beer expert to have tried any of these beers, but we can tell a few things about this list at first glance. About a third of the 82 beers this user has tried are American beers, the average ABV of the beers is a bit higher at ~7.38, and this user seems to have an affinity for IPAs.

Trying 82 of anything automatically makes you an expert in that area in my book, so our recommendations need to be top notch for this expert beer taster. Our Collie model says the following new beers should be recommended to this user:

  1. 90 Minute IPA (an American Double / Imperial IPA)
  2. Brooklyn Black Chocolate Stout (a Russian Imperial Stout)
  3. HopDevil Ale (an American IPA)
  4. Sierra Nevada Celebration Ale (another American IPA)
  5. Stone Ruination IPA (another American Double / Imperial IPA)

Notice our recommendations tend to mostly be American beers and IPAs, something we know this user has an affinity for. Also, the #4 beer in the list of beers our user has interacted with is made by the same brewer as our #2 recommended beer. Even though our Collie model has no way to know the beer name, style, or brewer yet, it is able to pick up on some incredible patterns, meaning that our beer-loving user 53399 has a few more beers to review!

What’s Next?

Alright, so we now have a model and it seems to be learning something! But, we can probably do better, right?

A GIF of Kristen Bell saying “Spoiler Alert.”
Side-note: Kristen Bell is amazing.

Well (spoiler alert), we definitely can do better! You might think that the only way we can get better performance is with more data or using side-data in our model, and while we’ll certainly do that in this blog post series, there are still more tricks we can do to cut down on training time and have a significantly better model. Rather than make a novel of a blog post, I decided to split this into different posts, so in the next blog post in this series, we’ll explore some more of these options baked right into Collie to boost model performance.

I’m sure you’re on the edge of your seat wondering when you can read this next post! Well, it’s already posted, and you can read it here.