Implicit BPR recommender (in Tensorflow)

Victor
Rn Engineering
Published in
11 min readMar 15, 2019

--

This is a summary and Tensorflow implementation of the concepts put forth in the paper BPR: Bayesian Personalized Ranking from Implicit Feedback by Steffen Rendle, Christoph Freudenthaler, Zeno Gantner and Lars Schmidt-Thieme.

Content:

  • Introduction
  • Bayesian Personalized Ranking
  • The Tensorflow model
  • The dataset
  • Ok, let’s write it! (the code)
  • Summary
  • References

Introduction

This post relies heavily on the concepts covered in detail in my story on ALS Implicit Collaborative Filtering. So for the sake of not being super repetitive, I will assume that you are familiar with matrix factorization, the difference between implicit and explicit feedback, the concept of latent features, using latent space vectors to make recommendations etc.

I will instead focus specifically on how the Bayesian Personalized Ranking (BPR) approach works and how to implement it using Tensorflow.

Bayesian Personalized Ranking

The paper puts forth something they call “BRP-Opt”, a generic optimization criterion for optimal personalized ranking. Basically what this means is an approach that can be applied to different types of recommendation models like Matrix Factorization or k-Nearest-Neighbour (that's the “generic” part) and that solves a ranking for a set of items for a given user. In that way, it’s similar to ALS that also uses matrix factorization and tries to arrive at a ranking by using the preference and confidence values in the data.

The implicit problem

As a quick refresher, the core problem of any implicit feedback recommender is how to treat the unknown values in our data. A user might not have played a song because they hate it or because they have not yet discovered it (and it’s our job to make sure they find it), so we can’t just treat all unknown values as negatives. Instead, we need to find a way to learn from the items the user has not interacted with as well as the ones they have.

But while we can't treat all unknown items as something a user does not like, it’s reasonable to assume that in general, we can be more confident that a user has a higher preference for an item that they have consumed than one they have not. If I’ve played songs by Bob Dylan but never a song by Megadeth we can assume that, in general, I have a stronger preference for the former rather than the latter.

This is the strategy BPR uses to come up with its ranking. Whereas most other optimizers just look at if a user interacted with an item, BPR looks at the user, one item the user interacted with and one item the user did not (the unknown item). This gives us a triplet (u, i, j) of a user (u), one known item (i) and one unknown item (j).

We can express the relationship between known and unknown items like this:

Here x̂ui denotes the score for user u and a known item i and x̂uj the score for user u and unknown item j. The above condition is satisfied if the score for the known item is larger than that of the unknown one.

NOTE: One important condition we assume is true is that all users act independently of every other user and that the ordering of any pair of items for any given user is independent of the order of any other pair of items.

Bayesian formulation

BPR uses a Bayesian formulation to find a personalized ranking for a user for all items i in the set of items I ( i I ) by maximizing its posterior probability.

The key word here is “Bayesian”. It comes from Bayes theorem that calculates the probability P that some hypothesis H is true given some event E. You can calculate this probability by taking the prior probability that the hypothesis was true P(H) multiplied by the probability of the event given that the hypothesis is true P(E|H) divided by the total probability of the event occurring P(E):

In this equation P(H|E), the probability of our hypothesis being true given an event is called the posterior probability. It can also be thought of as that the posterior probability is proportional to the likelihood multiplied by the prior probability.

Posterior probability ∝ Likelihood x Prior probability

So what we want to do is to use a formulation to update the probability of our hypothesis as more events/information becomes available, i.e we want to maximize the probability of it being true.

A closer look at the math

So we know the concept of using triplets (u, i, j) rather than pairs for ranking and what Bayesian inference means. Now let's dive a bit deeper and have a look at the specifics of deriving our final optimization criterion. I will be leaving some stuff out so for a comprehensive explanation have a look at the paper.

Using the formula Posterior probability ∝ Likelihood x Prior probability the posterior probability we want to maximize in this case is:

This reads as “the probability of Θ given >u is proportional to the probability of >u given Θ multiplied by the overall probability of Θ”. Here >u is a latent preference structure for user u and Θ represents the parameters of some kind of model, like matrix factorization or kNN.

So basically we want to maximize the probability of parameters Θ given a latent preference structure for user >u. The paper shows us that the product of the likelihood p(>u | Θ) is equal to the product of p(i <u j | Θ):

Next, we define the likelihood that a given user actually prefers the known item i over unknown item j as the following:

Here σ is the sigmoid function and x̂uij(Θ) is some kind of function that models the relationship between a user, a known item and an unknown item given a set of parameters. BPR does not dictate what function this is and can, therefore, be used with a number of different model classes. In our case this function is the difference between the score of u and j subtracted from the score of u and i:

We can then rewrite it to use the sigmoid function for the likelihood. The paper also suggests using ln sigmoid (natural log) based on their MLE (Maximum Likelihood Evaluation):

Remembering the logarithm product rule ( ln(a * b) = ln(a) + ln(b) ) we can now change the whole equation to:

Based on this we then arrive at the final optimization criterion for our model:

  • Θ: Our model parameter vector.
  • x̂uij: Relationship between (u, i, j). Here the score of (u, j) subtracted from the score of (u, j).
  • ln: The natural logarithm
  • σ: The logistic sigmoid
  • λ: Lambda, the regularization hyperparameters for our model.
  • ||Θ||: The L2 norm of our model parameters

The Tensorflow model

I will assume you know the basics of Tensorflow and instead focus on the specifics of this implementation. If you don’t really care about the specifics all you need to know is that Tensorflow is a machine learning framework that, at its core, lets you define matrix and vector operations in a graph and then execute said graph while feeding it some data.

With that said, these are some of the nodes we will be using when building our graph that might not be super obvious to you if you have not written anything in Tensorflow before:

  • Variables: Nodes that maintain their state in the graph across multiple executions. These are the things in our graph we want to train in order to minimize the loss function.
  • Placeholders: As the name suggests these are placeholders for the data we later feed into the graph as we execute it. They contain no data when we create them, just the shape of the data it will later receive.
  • Embedding lookup: Lets us retrieve a row from a tensor given its index. Similar to indexing into a numpy array.
  • Optimizer: There are different types of optimizers but what they do is to calculate the loss function and then update the values in the graph for the next run in order to minimize that loss.

The model

We will implement BPR using matrix factorization which means taking the interaction matrix R of size (all users x all items) and factoring it into matrix U of size (all users x latent features) and V of size (latent features x all items). So when we’re done want R ≈ U x V.

We will use Stochastic Gradient Descent (SGD) to arrive at this approximation. It works by initializing U and V with uniformly random values and then, using the optimization criterion from before, continuously updating them with new values to maximize the posterior probability i.e. getting a higher and higher probability for R = U x V.

Let’s have a look at a simplified illustration of the graph we will be implementing and how data will flow through it.

So we start at the top with our inputs (placeholders) which will be a user id u, a known item id i and unknown item id j. We then also randomly initialize our matrixes U and V (variables).

From here we create three embeddings, one for the user factors, known item factors and unknown item factors. We take these embeddings and multiply them along with a bias to get x̂ui and x̂uj respectively.

Next, we take x̂ui and x̂uj, compute x̂uij and feed it into the negative natural log — of the sigmoid — of x̂uij. So we get -ln σ(x̂uij).

Finally, we add the L2 norm to -ln σ(x̂uij) and arrive at the final loss function:

Let’s expand this out to see it in the form it will actually be implemented in code:

(If you want you can also skip the first negative sign and instead subtract ln σ(x̂ui-x̂uj) from the L2 norm. Not really sure what looks best.)

The dataset

As before we’ll be using the lastfm dataset containing the listening behavior of 360,000 users. It contains the user id, an artist id, the name of the artists and the number of times a user played any given artist. There is also stuff about user ages, genders, countries etc. but we’re not interested in that for now.

Ok, let’s write it!

First, we’ll import the libraries we need, load the dataset and do some wrangling to get it into the shape we need it.

Next, let’s set up some hyperparameters for our models. These are by no means tuned so you should definitely play around with them.

Now to the fun stuff. Let’s define the Tensorflow graph as discussed above. We start by initializing the graph and then define a couple of helper functions to make the code a bit cleaner.

We create one function that just initializes a Tensorflow variable with random values, one that uses that function to create a variable and then embeds it using an embedding lookup and one for getting the value of a variable back given its name.

Then on to the actual graph:

First, we create placeholders for u, i and j that we can later feed data into. We initialize our U and V matrixes and create embeddings for u_factors, i_factors, and j_factors. We also create embeddings for our biases i_bias and j_bias.

Next, we calculate the score for our user and item i as well as for item j to get x̂ui and x̂uj and then we compute x̂uij.

Here we also the AUC score and add it to our Tensorflow summary so we can monitor it during training.

For the L2 norm, all we need to do is to square each parameter multiplied with the regularization value and then add them all together. Taking the negative natural log of the sigmoid of x̂uij, plus the L2 norm gives us our final loss function. Lastly, we use the built-in Adam optimizer to minimize our loss.

Now the thing with Tensorflow is that we have not yet done any actual calculations. You would not be able to run the above code and print some value. All we have done is define a graph we will now feed data over a number of iterations. We’ve built a system of pipes, now it’s time to pour water into it.

For each epoch and batch we defined in our hyperparameters we sample a number of random indices from our dataset and then get user ids (u) and known item ids (i) matching those indices. We then sample 15 000 random unknown items (j).

These items go into our feed_dict which is a dictionary we will feed into our graph. We then run the graph with those values, update the progress bar and display the current loss and AUC values.

And that’s it. Now we can train our model for a number of epochs hopefully watching the loss steadily decreasing towards 0 and the AUC value increasing towards 1.

Finding similar items

Now that we have a trained model we can use it to rank some items. Let’s start by finding some artists similar to Beyoncé. We use the get_variable helper function we used before to grab the trained U and V vectors. Then we use the same code as we did for ALS to calculate the most similar artists.

Running this gives us that the most similar artists to Beyoncé are: Rihanna, Mariah Carey, Britney Spears, Black Eyed Peas, Michael Jackson, Nelly Furtado, Katy Perry, Alicia Keys, and Justin Timberlake. Remember, these are not necessarily similarities in genre or style but rather in listening behavior. Looks like it makes sense though.

Running it again with Bob Dylan gives us the following: The Beatles, Beck, The Rolling Stones, David Bowie, Pink Floyd, Radiohead, Led Zeppelin, The Cure, R.E.M etc. After 50 epochs we get a loss value of 0.051 and an AUC value of 0.997 or 99.7%.

Making recommendations

The same goes for making recommendations for a specific user. Once we get our U and V matrixes we can use the same code as for ALS to return the n highest ranked artists for that user.

Summary

So there we have it, a basic implementation of BPR using Tensorflow. One improvement to the algorithm proposed by the author of the original paper not implemented here is to change the way we sample our (u, i, j) triplets. You can read more about it here.

References

--

--