The intended audience for this technical article would have a basic familiarity with machine learning and some associated terminology, but I hope there can be insights for anybody. For a more high level description of the work going on in our team, you can check out this article.
How do you build a session-based recommendation system effectively and accurately?
But before we dive into the details, I’ll start by setting the stage. As an avid browser, I’m always looking for new ways to find content, be it through social media, searching or word-of-mouth. I’m a man of simple tastes — I like locally distilled spirits, and complaining about the Toronto Transit Commission (TTC) despite rarely using it.
I hope to be able to find news that interests me easily, such as when a raccoon tried to catch the Toronto subway or when my home province won big at the World Gin Awards. For a content creator like CBC, we need to find new ways of bringing the right content to the right user. This is a practice already embraced and in practice with companies including Netflix, Spotify, and Youtube.
So enters a top-N recommendation system, a class of algorithms that enable us to surface personalized content based on user behaviour. Given a user’s reading history, we want to provide the N most likely articles (in the above image, we have N set to 5, for example) they’d want to read next. The different solutions to this problem would take a while to cover but I’ll touch on how our team CBC is tackling it with machine learning.
For those unfamiliar, the RNN performs sequence modelling and classification, given a sufficiently large dataset for training. To mitigate training instabilities found in a vanilla RNN, the Gated Recurrent Unit (GRU) RNN is used.
You clever folks may have realized that predicting the next article a user reads is a sequence classification problem, like in the above diagram! Given a sequence of articles, attempt to predict the next article, like so:
Given a set A of n articles, a user sequence U for user j, and a GRU that gives a likelihood score for a sequence-article pair, yield the best article by simply taking the argmax:
This is fine and dandy for next-item prediction — predicting one article for a user’s next read — and can be trained with cross-entropy loss. However, in practice, learning-to-rank approaches tend to outperform this method, which requires a different loss function.
Taking cues from this landmark paper, the authors of Session-Based Recommendations with Recurrent Neural Networks derive new pairwise loss functions for this problem, i.e. the value of the loss is proportional to the score difference for a target (the right answer), and a negative example (the wrong answer).
Our system uses an implementation of one such loss — defined here — called TOP1-max. It’s written as follows:
Consider the set A of all articles. Let’s take a set of n negative examples from A, i.e. articles the user didn’t read, and call it N. The GRU gives each article j in N a score, call it r_j. This allows us to calculate softmax scores s_j for all negative examples in N, like so:
Calculating the loss L, given the score r_t for the target (correct) article, goes as follows:
Optimizing the GRU by minimizing this loss, we observe significant performance gains, which brings us to this question — how do we measure performance?
Our database keeps track of the last few days of users’ reading history. Note we say reading instead of clicking — if a user spends too little time on an article, it won’t register. This behaviour functions as a high-pass filter on the user signal (i.e., whether or not they like the article), and also as a clickbait filter.
Given a user’s session history, we chop off the most recent event for training. If you read three articles, we use the first two to predict the third.
“But Jon, you daft duckling,” you may say, “might this introduce a recency bias into the model?” Valid point! It might! Considering only the final element in a session may cause the model to learn more recent user dynamics. Our hypothesis is that this keeps recommendations relevant.
Popular evaluation metrics for recommendation models like this include recall@k (the % of times the correct label is in the top k predictions), MRR@k (the Mean Reciprocal Rank), and catalogue-coverage@k (the % of possible content that you’re recommending when serving k predictions).
We use batch stochastic gradient descent to train. All above metrics are tracked by our friendly neighbourhood TensorBoard.
As a sanity check, we compare all metrics post-training against a simple baseline — predicting the most popular articles. If we can’t beat that, we’ve got a problem on our hands.
We train a new model like this every hour, using Google’s CloudML service, and serve it via a Flask API, through Google AppEngine. This setup allows us to handle hundreds of requests a second, which is important for major traffic spikes during breaking news. To keep the whole process speedy, we use a small GRU, keeping training times down to 5–10 minutes per model.
Eagle-eyed readers may have noticed that our catalogue coverage for k=5 only makes it to ~20 per cent during training, which sounds like very little (and it is!). Some exploratory analysis might shed some light.
Our dataset, as one might expect from any news dataset, is skewed by popularity bias (anything being rapidly shared or about politics) and position bias (anything selected for the front page of cbc.ca). A look at the long-tail distribution might make this clearer.
The above plot is pretty dense, so I’ll elaborate. Pick an hour of the day, say 12 p.m. (pale green). Find 20.0 on the log x-scale for the line chart, and walk up vertically until you hit the pale green line. The y-value here, ~70 per cent, is how much browsing history is covered by the top 20 per cent of articles at lunchtime. That’s a lot!
The GRU achieves a decent recall@5 (~35 per cent), but the bias in the data reveals its weak point: diversity. Work is underway on tackling this issue. Keeps me up at night.
Our foray into recommender systems at CBC has been a fun one so far, and A/B tests are underway to validate them. Next time you’re checking out the news, look out for your own recommendations! For now, you can find it at MyCBC.
Also, we’re hiring!