Using Per-Items Features

Welcome back! We have already seen the recommendations made using collaborative filtering.

Akshita Guru
Operations Research Bit
6 min readJun 16, 2024

--

Now we are going to discuss its algorithm further.

Image by https://www.researchgate.net

So here’s the same data set that we had previously with the four users having rated some but not all of the five movies. What if we additionally have features of the movies? So here I’ve added two features X1 and X2, that tell us how much each of these is a romance movie, and how much each of these is an action movie.

So for example Love at Last is a very romantic movie, so this feature takes on 0.9, but it’s not at all an action movie. So this feature takes on 0. But it turns out Nonstop Car chases has just a little bit of romance in it. So it’s 0.1, but it has a ton of action. So that feature takes on the value of 1.0.

So you recall that I had used the notation nu to denote the number of users, which is 4 and m to denote the number of movies which is 5. I’m going to also introduce n to denote the number of features we have here. With these features we have for example that the features for movie one, that is the movie Love at Last, would be 0.90. And the features for the third movie Cute Puppies of Love would be 0.99 and 0.

And let’s start by taking a look at how we might make predictions for Alice’s movie ratings. So for user one, that is Alice, let’s say we predict the rating for movie i as w.X(i)+b. So this is just a lot like linear regression.

  • For example if we end up choosing the parameter w(1)=[5,0] and say b(1)=0, then the prediction for movie three where the features are 0.99 and 0, which is just copied from here, first feature 0.99, second feature 0. Our prediction would be w.X(3)+b=0.99 times 5 plus 0 times zero, which turns out to be equal to 4.95. And this rating seems pretty plausible.
  • It looks like Alice has given high ratings to Love at Last and Romance Forever, to two highly romantic movies, but given low ratings to the action movies, Nonstop Car Chases and Swords vs Karate. So if we look at Cute Puppies of Love, well predicting that she might rate that 4.95 seems quite plausible. And so these parameters w and b for Alice seems like a reasonable model for predicting her movie ratings.

More generally in this model we can for user j, not just user 1 now, we can predict user j’s rating for movie i as w(j).X(i)+b(j). So here the parameters w(j) and b(j) are the parameters used to predict user j’s rating for movie i which is a function of X(i), which is the features of movie i. And this is a lot like linear regression, except that we’re fitting a different linear regression model for each of the 4 users in the dataset.

  • So let’s take a look at how we can formulate the cost function for this algorithm. We have already discussed about Cost Function.

So let’s write out the cost function for learning the parameters w(j) and b(j) for a given user j. And let’s just focus on one user on user j for now. I’m going to use the mean squared error criteria. So the cost will be the prediction, which is w(j).X(i)+b(j) minus the actual rating that the user had given. So minus y(i,j) squared. And we’re trying to choose parameters w and b to minimize the squared error between the predicted rating and the actual rating that was observed.

But the user hasn’t rated all the movies, so if we’re going to sum over this, we’re going to sum over only over the values of i where r(i,j)=1. So we’re going to sum only over the movies i that user j has actually rated. So that’s what this denotes, sum of all values of i where r(i,j)=1. Meaning that user j has rated that movie i. And then finally we can take the usual normalization 1 over m(j).

And this is very much like the cost function we have for linear regression with m or really m(j) training examples. Where you’re summing over the m(j) movies for which you have a rating taking a squared error and the normalizing by this 1 over 2m(j). And this is going to be a cost function J of w(j), b(j). And if we minimize this as a function of w(j) and b(j), then you should come up with a pretty good choice of parameters w(i) and b(j). For making predictions for user j’s ratings.

  • Let me have just one more term to this cost function, which is the regularization term to prevent overfitting. And so here’s our usual regularization parameter, lambda divided by 2m(j) and then times as sum of the squared values of the parameters w. And so n is a number of numbers in X(i) and that’s the same as a number of numbers in w(j). If you were to minimize this cost function J as a function of w and b, you should get a pretty good set of parameters for predicting user j’s ratings for other movies.
  • So we have that to learn the parameters w(j), b(j) for user j. We would minimize this cost function as a function of w(j) and b(j). But instead of focusing on a single user, let’s look at how we learn the parameters for all of the users. To learn the parameters w(1), b(1), w(2), b(2),…,w(nu), b(nu), we would take this cost function on top and sum it over all the nu users.
  • So we would have sum from j=1 one to nu of the same cost function that we had written up above. And this becomes the cost for learning all the parameters for all of the users. And if we use gradient descent or any other optimization algorithm to minimize this as a function of w(1), b(1) all the way through w(nu), b(nu), then you have a pretty good set of parameters for predicting movie ratings for all the users.

And you may notice that this algorithm is a lot like linear regression, where that plays a role similar to the output f(x) of linear regression. Only now we’re training a different linear regression model for each of the nu users. So that’s how you can learn parameters and predict movie ratings, if you had access to these features X1 and X2.

That tell you how much is each of the movies, a romance movie, and how much is each of the movies an action movie? But where do these features come from? And what if you don’t have access to such features that give you enough detail about the movies with which to make these predictions?

In the next article, we’ll look at the modification of this algorithm. They’ll let you make predictions that you make recommendations. So STAY TUNED!

References

[1] For collaborative filtering click here

[2] You can connect me on the following: Linkedin | GitHub | Medium | email : akshitaguru16@gmail.com

--

--