Netflix Million Dollar Challenge: Understanding the algorithm that won a million dollars
Providing recommendations is ubiquitous for a variety of products and services. One prime example is Netflix. In 2006, Netflix made an open challenge to data scientists world wide: improve their existing recommender system by 10%, win a million dollar. They opened a huge training and test data set for data scientists to use. Koren et al. ultimately won the prize with their matrix factorization algorithm.
Background
There are multiple recommender systems i.e. that spits out recommendation for a particular user. One such method is called content based filtering approach where you create profiles for items and users and give recommendation based on those profiles (content of the profiles) i.e. a person who likes action movies will enjoy movies that are tagged as ‘action’ or you can say that a young male teenager (demographic data), on average, might like a fast-paced, action movie. The obvious downside to this is that you need to gather data about the items and the users. An alternative approach, called collaborative filtering, relies on past user actions to predict future behavior.
The two primary areas of collaborative filtering are the neighborhood methods and latent factor models.
Neighborhood methods
Neighborhood methods are centered on computing the relationships between items or, alternatively, between users. The item oriented approach evaluates a user’s preference for an item based on ratings of “neighboring” items by the same user or a group of users related to each other by some factors i.e. friends etc. A product’s neighbors are other products that tend to get similar ratings when rated by the same user or group of users. Figure 1 from the paper gives a visual explication for neighborhood method:
Latent factor model (LFM)
LFM relies solely on user ratings for items and infers factors that affected the user’s decision behind the rating i.e. did the user rate Saving Private Ryan high because they love war movies in general?
Matrix Factorization
Our input is the rating matrix R (n x m), where each row is a user and each column is a movie. R(I,j) is the rating user I has given to movie j. We decompose R into two matrices P(n X k) and Q(k X m). P(n x k) denotes the users’ preferences for certain latent factors and matrix Q(k x m) denotes the presence/absence of the aforementioned latent factors in the items (movies and shows for Netflix). In other words, for a given user u, the elements of P(u) measure the extent of interest the user has in items that are high on the corresponding factors. For a given item i, the elements of Q(i) measure the extent to which the item possesses those factors. We do not know what these factors are, unfortunately. But, once we have generated P and Q, we can take educated guesses on what they are. For example, after generating P and Q, we see the Saving Private Ryan, Pearl Harbor and Apocalypse Now has high correlation with Factor 1. We also see that User A, who has rated Band of Brother very highly, has a high correlation with Factor 1. Our educated guess, at this point, can be that Factor 1 points to the genre “war shows”.
Decomposing R into latent factor matrices
Given, P and Q, we can calculate R(u,i) by the following formula:
Therefore, if we can populate P(u) for all users and Q(i) for all items, then we can predict ratings by users, which, in turn, can be used to give recommendations. So how do we populate P(u) and Q(i)?
Initially, all entries in the matrices P and Q are unknown. We build P and Q through an optimization process.
How many unknowns do we have? All the entries in P and Q, i.e. we have (nk + km) unknown real numbers. In the optimization process, we will initialize P and Q randomly, and start adjusting all their entries, such that whenever there is a non-missing rating r(i,j) in R, the inner product of corresponding k-dimensional vectors P(i) and Q(j) will approximate the rating r(i,j). I’m using P(i) to denote the i-th row of P and Q(j) to denote the j-th column of Q. Since we know this value, we calculate how good the estimation is by taking the difference between the user given rating and the algorithm approximated rating. Then we adjust P(i) and Q(j) accordingly to minimize this difference a.k.a the error. This is the learning step and the process used by Koren et al. is called stochastic gradient descent.
When some rating r(u,v) is missing in R (user u didn’t rate item v), we just don’t know how to constrain the inner product of P(u) and Q(v), so we will just “let them be” without worrying about them.
But, the user u rated other items, the entries of P(u) will surely be adjusted elsewhere; and similarly, because the item v was rated by other users, the entries of Q(v) will be adjusted elsewhere. So, P(u) and Q(v) are NOT left untouched in the optimization process.
Koren et al. also takes into account user biases, for example, the user being a serious or casual movie reviewer and how they vary over time. They project these biases as functions of time, and use them to learn about user biases, so that they can give the best recommendation based on the user’s current bias. For example, a user might have started their reviewing journey as a casual reviewer but became more critical as life forced them to be more cynical (all hypothetical!). So their initial movie ratings are highly but their recent ratings took a sad, downward turn. A rating of 6 might seem low for an average user, but for this new cynical user, its decent. We need to be able to recognize and utilize these biases.
Moreover, how do you take hype into consideration? We sometimes give higher ratings to a movie not because we absolutely love it but because everyone else loves it. Or, it might be because of extravagant marketing. Koren et al. takes this into consideration by adding a confidence variable to their equation.
The ultimate equation looks like this:
Here, c(u,i) is the confidence, r(u,i) is the rating given by the user, mu is average rating for all movies, b(u) is the user bias, b(i) is the movie bias and p(u)T.q(i) is the algorithm approximated rating.
Hopefully, this post sheds some light on this amazing paper! It’s simply astounding to see the data they have been able to extrapolate just from user ratings.