How do Recommendation Systems work?
If I ask you to tell me one application of Machine Learning that you use everyday, what comes to your mind? No points for guessing! Recommendation engines are common in many products that you use daily, so much so that you do not even consciously think about it. Be it YouTube’s next video suggestion to Amazon showing you “People who bought this also bought..” to Medium suggesting you which articles to read next to Netflix recommending which movie you should watch next. It’s everywhere! Let’s understand approaches to building recommendation systems.
Let’s take the example of Netflix recommendation system. The problem statement is simple: Recommend movies that your users might watch next.
Take 1: Recommend the most popular movies
This could be observed in many news sites where they display the most read/most emailed/most shared stories segmented by different news categories. So, you could do the same here. Recommend the most watched/most reviewed/most starred movies across various genres.
Pros:
Very simple to implement.
Cold start problem is not encountered here.
Cons:
Lot of work. We might hate to admit this but we like things to be handed over to us on a plate don’t we? Imagine yourself having to decide every time about which movie to watch next after going through the synopsis, the reviews, the ratings, etc. Painful, isn’t it? No one likes browsing, not especially for entertainment. This is the era of personalization. People want to discover things that they haven’t even thought of or heard of.
Take 2: Build a classifier based on user demographics and user actions
Another approach could be to predict whether an user would like or dislike a movie. So, build a model using some or all of the following features:
+ user’s demographics(age,gender,country,etc.)
+ Features (genre,length,lead actor,etc) of the movie for which prediction is being made
+ Features (genre,length,lead actor) of recent movies watched
+ Frequency/Recency/Duration of movies watched
Pros:
Features are personalized. User’s history / habits of watching movies are captured in this approach.
Cons:
Enough features might not be available to predict accurately.
Take 3: Recommend movies based on other people’s tastes
Also seen as “People who watched X also watched..” or “People who purchased this also bought..”. This approach is also known as collaborative filtering. Implementation is simple. Create a M x N matrix where rows & columns represent movies available on Netflix. Each cell in the matrix would represent the number of users who watched both M & N. The matrix created is also known as “Co-occurrence matrix”. Now, making recommendations is really simple. If a user has watched a movie i, just look up i-th row in the matrix and recommend top-k items in that row.
What if there are very popular movies ? For any row, say top-3 movies are always the same. Would you recommend the same list to everyone then? In that case, it won’t be any different than our popularity based approach. Any solutions?
Well, one of the approach would be to normalize the co-occurrence matrix. So what are we trying to do here? For a movie i, we are trying to find similar movies to it by looking at all the movies j across the column in the co-occurrence matrix. We could use a similarity metric called Jaccard Similarity, to normalize the co-occurrences.
So, for each i,j:
S[i,j] = (# of users who watched both movie i and j)/(# of users who watched i or j).
Here S is the similarity matrix computed by applying Jaccard similarity on the co-occurrence matrix. But..
Recommendations are only based on the current movie watched. In order to get more relevant suggestions, the history should be considered as well right? Alrighty, one way to solve this would be to take a weighted average of the similarity between all my recently watched movies and the candidate movies up for recommendation. So, let’s say I have watched 3 movies X,Y,Z recently and candidate movies for recommendation are i to n.
Score(i) = 1/3(S[X,i] + S[Y,i] + S[Z,i])
So, we calculate Score for all movies i to n and recommend the top-k movies. But..
- We do not consider the context (time of the day) , user-specific features or product-specific features (director, actor, genre, etc)
- There’s the issue with cold start problem. What if a new user or a new movie arrives?
Take 4: Matrix Factorization
Let’s say that we have a dataset of users,movies and ratings. So, for each user we know the rating s/he gave to a subset of movies. Now, we can create a m x n matrix Rating where Rating[u,v] would be the rating given by user u to movie v. Practically, the matrix Rating would be sparsely populated, since each user would have watched a small subset of movies available. The goal is to predict the ratings for all movies that a user hasn’t watched. How to do that?
We have to factorize the matrix Rating into two matrices L and R such that
L is a m x k matrix
R is a k x n matrix
L[u] consists of user-specific features like how much s/he likes action,drama,thriller,etc.
R[v] consists of movie-specific features like how much is it action,drama,thriller,etc.
So, Rating[u,v] is the product of vectors L and R.
Now, we could start making recommendations by sorting movies on the basis of Rating[u,v] for all the movies v that user u hasn’t watched. But..
The catch is that we don’t really know L and R. How do we find that? Many algorithms are already there that does this. Let’s understand it intuitively.
Do you remember Linear regression? It’s nothing but fitting a curve through the points. Let’s take a simple example. We are given training data (x,y) where x is the independent variable and y is the dependent variable ( the variable we want to predict). So, what Linear regression does is to figure out the best line that fits the data points (x_i, y_i) . A line y = m.x+ c, is determined by it’s parameters (m,c). So, Linear regression would try to figure out the optimal parameters (m,c) so that the RSS (residual sum of squared error ) is minimized i.e, the line fits as many points on it as possible or the points are as close to the line as possible. The RSS can is computed as Sum((y_i-y_predicted)²) for i=1 to n.
Let’s map the above intuition to our matrix factorization problem.
(x,y) => (u,v,Rating[u,v]) here, u,v are the features and Rating[u,v] is the score.
(m,c) => (L,R) here, L & R are the parameters of the Matrix Factorization model, just like m,c are the parameters of the Linear Regression model.
So, the matrix factorization algorithms would try to estimate the Rating matrix for different values of L & R and see if the RSS is minimized. Once we have optimal L & R, we can make predictions for the ratings of movies not watched by the users and recommend them accordingly.
The cold-start problem still remains. The model wouldn’t be able to handle a new user or a movie on the system.
So, in this post you learnt various approaches to recommendation system. In the next one we’ll implement a recommendation system.
References:
Thanks for reading. :)