Fetching Better Beer Recommendations with Collie (Part 2)

Nate Jones
7 min readMay 4, 2021


Saving some time by training better models.

Part 1 | Part 2 (you are here) | Part 3

An image of the Collie dog looking at glasses of beers with a question mark thought.

TL;DR — I talk more about ShopRunner’s latest open source library, Collie [GitHub, PyPI, Docs], for training and evaluating deep learning recommendations systems. We train better models and evaluate them. But still, can we do better?

A Small Recap

By this point in the blog series, we have prepared a dataset of users trying different beers. We have prepared this dataset for use in a Collie model, and trained a basic matrix factorization model that is learning something, but is not quite great yet. Can we do better??

A Fancier Architecture

After the last blog post, you might be laughing to yourself that our matrix factorization model was destined to underperform because of its simplistic architecture. “Why is a blog post in 2021 talking about an architecture invented in 2006 — the same year when people finally realized Pluto is not a planet?”

So let’s just get this out of the way and try a fancier architecture, one that really justifies the use of deep learning to its maximum potential.

A GIF from Star Trek, with a character saying, “I’m drunk on power.”

One such architecture implemented in Collie is Neural Collaborative Filtering, an architecture introduced in 2017 that has since been used in many top-scoring Kaggle competition submissions. Notice how similar the API for this new model architecture is compared to the simple matrix factorization model above.

The MAP@10 scores table showing a new row added for Neural Collaborative Filtering, with score 0.00120.

Surprisingly, all this power in the Neural Collaborative Filtering architecture doesn’t outperform even our most basic matrix factorization model. Your mileage may vary greatly with architecture choices, as so much of recommendations is dependent on the data used.

While many of the most popular recommendation model architectures are already implemented in Collie, it’s simple to create a new one without repeating much of the boilerplate used in existing models. Since every Collie model extends from the same parent class, it’s simple to swap out model architectures and even try out completely novel ones with very few code changes. Collie was built with experimentation in mind, so don’t be afraid to try something new! You can read more about that in the documentation here.

For the rest of this blog post, we’ll actually stay with the matrix factorization architecture, since there’s more we can do to significantly improve the results of this model with 0 architecture changes!

Penalize the Model More!

While a better technical definition surely exists, we can look at our cost (also known as loss) function as a proxy for measuring how wrong the model is in its recommendations, and thus how much to penalize the model.

Our model learns best when it makes mistakes, and the amount it learns scales with how badly a mistake it makes. The cost function shows how much the model needs to pay for its mistake — you can imagine that a really bad mistake during training will cost the model a lot more.

During our training period, we want to encounter times where our model ranks an item a user interacted with lower than something the user did not interact with (which is obviously a mistake). When the model ranks the negative item a bit higher, we only fine it a bit, but when it ranks it a lot higher, we can fine it a lot, and thus it learns a lot more.

A GIF from Hey Arnold of kids putting quarters into the hand of another kid’s hand.

If we have a model that’s just guessing item ranks, the probability it correctly ranks a positive item higher than a negative item is a coin flip — 50% odds. However, if we wanted to test if the model correctly ranks a positive item higher than ten negative items, our probability is like flipping a coin 10 times and hoping to always get heads — less than 0.1%. With these odds, we are much more likely to run into areas where the model makes mistakes. Ideally, we can use this concept to try to penalize the model a lot, so it learns a lot, and converges on a good recommendations solution, faster.

In Collie, we can achieve this using adaptive losses, where instead of comparing a single item a user has interacted with against a single item the user has not interacted with, we compare it against a bunch of items the user has not interacted with. We can then see which of the negative items is ranked the highest, then use that item to calculate the loss for our model, guaranteeing each time we are maximally penalizing the model during the training. While cruel, we have found models learn much more and much quicker by using multiple negative samples.

In Collie, doing this is simple, as in a-2-line-change simple (see lines 1 and 2 below for the important change):

We see evaluating this model leads to better results:

The MAP@10 table with an added row for Adaptive Matrix Factorization with MAP@10 score 0.02115.

Best of all, this added very little additional training time to the model training loop, but allowed the model to fully train in a third of the epochs compared to the initial model and still have better results! That’s a win in my book!

By default, every model in Collie uses adaptive loss (unless otherwise specified), so we will use this in every model going forward through this blog post series.

A Simple Optimizer Trick

A meme from SpongeBob showing a smaller, skinnier fish from surrounded by two body-building fish. The smaller fish appears to be flexing to the two larger fish. The caption reads “Me making the case that my matrix factorization model with a dot product is better than fancier deep learning recommendation architectures.”

Matrix factorization, despite being so simple, is incredibly effective for recommendations algorithms. However, this simplicity is a double-edged sword, since sometimes our optimizers will push the model in a direction that settles on a local loss rather than a more global optimum.

Imagine yourself in the position of the optimizer — you have to tweak a bunch of parameters until you get penalized less than you currently are. In front of you are two kinds of parameters: embeddings (which consist of many 30 different numbers per user/item) or biases (which consist of only a single number per user/item). You might try tweaking the embeddings some, but face little luck in lowering the loss much at all. However, you might then tweak the single bias term, and get a much lower loss right away (after all, it’s easier to tweak a single number than it is a set of 30). With the next batch of data, you’ll probably be inclined to save some time and just tweak the single bias term again rather than try and fiddle with all the numbers in the embeddings. And you might do this for every batch of data until the loss stops decreasing.

An image showing a stick figure with a control panel on the right labelled “Embedding” and a control panel on the right labelled “Bias Term.” The embedding control panel has 30 levers to adjust, while the bias term control panel has a single lever. The stick figure looks frustrated, saying to themself, “I see why the last person quit…”
Credit to Michael Sugimura for managing to make this illustration on PTO supplied only with a single, messy text description from me. Amazing.

The problem with this is that the model is essentially just learning to model these bias terms, meaning we will see that most predictions for our users will just be a near-static list of generally appealing items (since the bias terms will be so much larger than the dot-product of the embeddings). And while most people will probably be okay with these items (I’m sure most people can tolerate a Bud Light), it’s not personalized at all!

In Collie, we solve this issue by allowing multiple optimizers for each parameter type. We have an optimizer that can solely tweak the embeddings terms and learns fast, and use an optimizer that can solely tweak the bias terms and learn a bit slower. Both optimizers want to do well in their jobs and end up “cheating” less than a single optimizer would. Now, our model is actively trying to tweak all parameters that matter, and we end up with a personalized model that learns from its input data as expected and outperforms a single-optimizer model with better results and metrics. Sweet!

An image now showing two stick figures with a control panel on the right labelled “Embedding” and a control panel on the right labelled “Bias Term.” The embedding control panel has 30 levers to adjust, while the bias term control panel has a single lever. Each stick figure is working at its own control panel now.
Michael Sugimura has mastered the art of the stick figure.

And we can see this pays off in the results:

The MAP@10 table with an added row for a multi-optimizer matrix factorization model, with MAP@10 score 0.02178.

At ShopRunner, we have found this optimizer trick results in a model with nearly double the MAP@10 score!

What’s Next?

In this blog post, we’ve been able to get a nearly 60% increase in MAP@10 score using the same data and architecture, which is a pretty incredible feat to me! By understanding how recommendations models learn, we can tweak small parts of the training protocol for seemingly huge increases in metrics and performance. Best of all, our improved model is able to train in less time!

Of course, there is another blog post in this series, which means we can do even better. In the following post, we’ll talk about incorporating side-data directly into the model training to create a hybrid model that outperforms any model we’ve trained above. I’ll also use our best model to make beer recommendations for myself and give a full review. I’m sure your excitement levels are off-the-charts just thinking about that!

A GIF from the show Silicon Valley showing the character Dinesh celebrating to himself, saying, “Ooooooh! This is the best day of my life!”

Best of all, it’s already posted, and you can read it here.




Nate Jones

data scientist at LTK, consumer of Taco Bell