Ensemble Modelling in Personalized Ranking
Ranking = Recommendation presented as a list.
In my article How do Recommendation Systems work? we learned different approaches to building recommendation systems. In this post, we’ll learn about stacking different models together to build an Ensemble Ranking Model.
What’s an Ensemble?
It’s a technique of combining multiple models in an intelligent way. We use this technique to improve model performance and/or reduce the chances of selecting a poor one.
Let’s say that you want to watch some movies over the weekend. You select a list of 10 movies and ask your friends to rank them. Your friends have their own ideas about your tastes/preferences. So, all of them gives you a ranked list of movies that you should watch next. You don’t have time to watch all of them. You want to watch only 3. So, how do you figure out the top 3 movies based on their lists?
Here’s a single algorithm. Pick a rank-list, give the top movie score of 1 and give the last movie a score of 0. Everything in between should be a linearly spaced number in decreasing order. Now, for every movie, find out its score from all of the rank-lists and average it. You get a combined score for every movie. Simply, order them in decreasing order of their score and pick top 3. I got this idea from Stack-Overflow.
In our simple algorithm, we provide equal importance to the ranking of all of our friends. This is not true in real life. Some of our friends might know about our preferences better than others. So, their ranking should be given more weight/importance.
Now, map this scenario to our ML world.
Friends = Individual Models
Ensemble = A technique to combine ranking of individual models
Weight = Importance to be given to each model in the ensemble.
Let’s say that we want to build a recommendation system for our streaming service, WatchFlix. We have a list of users U and a list of movies M. We want to recommend k movies to the user so that the likelihood of the user watching one of them is high.
What kind of data do we have? We have implicit feedback from the user. This means that they have not rated any movie explicitly. So, we only know about the movies that they’ve watched on our platform.
- Interaction: Which movies were watched by the user in the past.
- Item Features: Information about the movie like Genre, Directory, Year Released, Actors, etc.
- User Features: Age,Gender,Country,Medium of consumption (mobile/website), Plan subscribed (Basic, Standard, Premium), etc.
Given all of these pieces of information, we have a lot of different ways of building recommendation systems. We’ll quickly recap different approaches and then go on to building the ensemble.
First things first, let’s create different matrices to represent our features.
User Features Matrix
It contains a user’s features. We should one-hot-encode it while modeling.
User Preferences Matrix
It contains a user’s affinity for attributes of movies based on his watch history.
Item Features Matrix
It contains movie-related attributes.
User-Item interaction Matrix
Each cell contains a flag representing whether a user has watched a particular movie or not.
Let’s quickly recap several approaches that we could use to build recommendation systems. Our ensemble model could use some or all of these approaches.
Model 1: Popularity Based Recommendation System
We just create a list of most watched movies of all time and show it to every user.
Model 2: Classification Based Approach
We train a classifier which takes user & movie features as input and returns a score that tells us how likely is this user to watch this movie.
Model 3: Similarity-Based Recommendation System
We have a vector representation of each user and item as a vector. So we can simply compute the similarity between the vectors using something like cosine-similarity.
3.1 user-user similarity
For a given user, find other similar users. Recommend the movies that they watched, which haven’t been watched by the current user.
3.2 item-item similarity
For a given movie, find other similar movies. Recommend the movies which haven’t been watched by the current user.
3.3 user-item similarity
We compute the similarity between user-preferences and item-features. This means that we figure out movies similar to the user’s tastes.
Model 4: Matrix Factorisation
We figure out the latent embeddings for user and movies. Embeddings of a user represent latent user tastes in movies. Embeddings of a movie represent the attributes of a movie. Latent means “hidden”. The latent vectors/embeddings wouldn’t make much sense to us in isolation (just like the hidden layers of a neural-net doesn’t). However, the latent vectors are representations of what the model has learned about a particular user/item. We tune them by penalizing/rewarding the final outcome, ie, how good are the recommendations. These embeddings have been learned from the User-Item interaction matrix using matrix factorization algorithms.
While recommending new movies to the user, we multiply the latent vector of a user with all latent item vectors and display top-k items to the user based on the score.
Model 5: Hybrid Models
Along with the user-item interactions matrix, we could provide side-features (user & item features) available to us during training. This might help in better embeddings of user & items.
Model 6: Ensemble
This is what this article is about. We have different algorithms which model various aspects of user/item/interaction features while generating a rank-list. How do we combine the power of the individual models?
The different models described above return a score vector, using which we can rank the movies.
According to these models, the ranking of movies would be,
Popularity : [Titanic, Avengers, Iron-Man, Spider-Man ]
Similarity : [Iron-Man, Spider-Man, Avengers, Titanic]
Matrix-Factorisation: [Avengers, Iron-Man, Spider-Man, Titanic]
We want to combine these rank-lists into one.
Now, there can be different ways of Ensemble learning. In this article, we talk about Stacking. We formulate stacking as a supervised learning problem.
Let’s define a score function which we want to learn,
score(u,v) = W1*M1(u,v) +.....+ Mi(u,v) + ..... + Wn * Mn(u,v)where,score(u,v) = probability that the user u will watch movie v.wi = weight of model MiMi(u,v) = score given to a (user,movie) combination given by model Mi.
Great! So we want to learn this function score() using the predictions of the individual models. In order to do that, we need to create features and target from historical data. Then, we can train a machine learning model to optimize for a metric we care about.
Good, so how do we go about creating features? What are the targets that our model should predict? As you can see from the score() function, our features should be a vector [M1, M2, M3,…Mn] and our target should be 0/1 (0 = user didn’t watch the movie, 1 = user watched the movie). Now, we can treat this as a classification problem and train a model which will predict a probability that a user u will watch movie v.
Our features are predictions from individual models. For a (user, movie) combination we need to get predictions from each model in order to create the feature vector for the ensemble.
How do we choose the (user, movie) combination? We use sampling techniques. For every user, we create a list of positive and negative examples. Positive examples are the movies watched by the user. Negative examples are the ones that weren’t. The number of movies watched by any given user is negligible compared to the total number of movies available on our platform. Hence, we need to limit the number of negative examples that we sample.
Now, for every (user,movie_sample) we make predictions using individual models Mi. This creates the feature vector [M1, M2, M3,…, Mi,…Mn] for our ensemble. What’s the target? It’s 1 since it’s a positive sample and 0 for a negative sample.
This is how our data-frame should look like:
We can see that the scores for a (user, movie) combination by different models are not on the same scale. We need to normalize each column (using something like Min-Max scaler) to bring them between (0,1) range so that all columns are comparable. Another approach would be to normalize each score between 0 & 1 using the algorithm mentioned at the beginning of the article, (reference: SO ).
Now we can train a classification model that will tell us the likelihood that a user will watch a movie given the prediction of model Mi.
Think of each of these models Mi as experts. We wanted to learn the importance of each expert. Hence, we trained an ensemble to learn the weights that should be given to each expert. Let’s say that after training we learn the weights w1, w2, w3 to be 0.1, 0.5, 0.7. While making a prediction for a (user-i,movie-i) combination, we obtain model predictions m1, m2, m3 to be 0.5, 0.3, 0.1.
So, using our score() above we calculate
score(u,v) = w1*m1 + w2*m2 + w3*m3
= 0.1 * 0.5 + 0.5 * 0.3 + 0.7 * 0.1
If we have 10 movies available for prediction, we can calculate the score(u,v) for each user u & movie v and rank them based on the score (higher is better). We can then choose top k movies and present them to the user.
We don’t need to extract the weights and calculate the score ourselves. Since we have trained a model, we could just call ensemble_model.predict() with the features (M1, M2, M3,….Mn) and get the likelihood.
That’s it, folks. I’ve shared these ideas with you, now go ahead and implement it on your favorite dataset. All feedbacks are welcome. Thank you for reading. :-)