Time dependent Recommender Systems (part 1)⏱️

Quentin Bacuet
Snipfeed
Published in
9 min readJul 4, 2019

--

In my previous posts (here and here) I did an overview of state-of-the-art general recommender systems. However, those models did not take into consideration time dependencies: that is, the order in which the user has interacted with the items. But don’t worry! In this series of posts we will dive deep into the world of time-dependent recommender systems.

Why take time into consideration for our recommender system?

Time influences user preferences in several ways. The main effects are:

  1. Time bias: the interest of a whole group of user may change over time. For instance, if you are recommending music and there is a new music style poping (let’s say trap music), your model should be able to integrate that.
  2. User bias shifting: user may change his/her rating habit over time
  3. Item bias: Popularity of an item changes over time
  4. User preference shifting: user tastes evolves

In this series of articles, I will:

➡️ Describe the dataset used + Expose the main offline metrics that we will consider to assess our models + discuss the theory, implementation and results of baseline methods (part 1)
➡️ Expose five state-of-the-art methods that use deep-learning, discuss the intuition/theory and results compared to the baseline (part 2)
➡️ Compare the results, advantages and disadvantages of the latest Collaborative Filtering methods implemented here with the time-dependent methods developed in part 1 and part 2.

Each post can be read independently. Don’t hesitate to clap if you like it 👏!

Dataset: MovieLens 20M

Initial Dataset

For the analysis we will use the well-known dataset MovieLens 20M.

Movielens

This dataset contains more than 20 million ratings from MovieLens, a movie recommendation service. Below, a sample of the dataframe:

Sample from the ratings file

The dataset lists 138K users and over 27K movies. We then binarize the ratings (we only keep the positive one). After that, we reorganize the dataset such that each user is seen as a list of movieId that he rated positively.

Filtering of the Dataset

We do some filtering on it. We filter out all the sessions that are too short or too long (5–300 movies) and we filter out all the items that have not been reviewed by enough users (5 users). Therefore, we have:

Number of users:  97096
Number of items: 13512
Sparsity: 0.5589%

Train-Validation-Test Dataset

To assess the quality of our models, we will split the dataset into 3 subsets, one for training, one for validation, one for testing. We’ll use the first subset to train the model, the second to select the best model during the training, and the last one to get the final metrics.

Each of the sets has a disjoint sets of users.

The Metrics: HR, NDCG, MRR and MAP

We will use 4 different metrics for completeness and to ease the comparison with other papers or articles. All the metrics measure the quality and utility of the order of the items in our recommendations.

For the metrics below we will use different notations: I is the indicator function, elemᵢ the ith item in the ordered recommendation, and test the set of items that we wish to obtain in our recommendations (our target, if you will).

Note that all the metric presented below are for only one user and hence should be averaged over all the user in the test/validation set.

Hit Rate (HR)

The first metric will be the hit rate (HR). The hit rate is simply defined as:

In a more concrete way the HR is defined as the number of the element from the k highest relevant items we recommended that are part of the expected set, over k (to be precise it’s the minimum between k and the number of possible items that are relevant).

Hence this metric goes from 0 (no element recommended are relevant) to 1 (all the recommended element are relevant).

Normalized Discounted Cumulative Gain (NDCG)

The second metric will be the NDCG. We first need to define the Discounted Cumulative Gain (DCG). The higher the DCG is, the better. The DCG@k is defined as:

The NDCG is the normalized cousin of the DCG, which means we’re projecting the scores between 0 and 1 so they translate between models:

Mean reciprocal rank (MMR)

The third metric will be the MRR. The MRR is simply defined as, with rank₁ the rank of the first relevant recommended item:

The MRR is the inverse of the rank of the first relevant item that we recommended.

Hence it goes from:

  • 0 : No items recommended are relevant
  • 1/k: The first relevant item is the last recommended item

  • 1: The first relevant item is also the first recommended item

Mean Average Precision (MAP)

The last metric will be the MAP. To define the MAP we first need to define the precision at cutoff k or P(k). The precision is defined as the number from our k most relevant recommendations that are part of the expected set, over k.

Then the MAP is defined as:

As always this metric goes from 0 (no element recommended are relevant) to 1 (all the recommended element are relevant).

Example

To illustrate those pretty abstract metrics, here’s a short example:

Note that the recommendations have an order.

Simple Methods

The first methods that we are going to describe are some very simple ones that are used for prediction only the last item seen from the user.

The training of those methods will be done by assigning a score s (between 0 and 1) for each pair of items a and b: s(a,b). And then for the prediction we simply pick the last item i in the sequence of the items seen by the user and output the k items with the highest score s(i,u) for u in items. More precisely, if the user has a sequence of seen items [i₁,i₂,i₃,…,iₙ], we will output:

For implementation purpose, the training of such method can be reduced to the creation of a dictionary mapping an item to a list (sorted by their score) of other items.

Simple Association

This method is the simplest one we will see for this entire series. Below I used a new notation #(a,s). It is the number of time ‘a’ appears in the sequence s:

In a more concrete way, the score is the number of co-occurrences of the two items in a sequence.

I added below a simple implementation using the Counter data structure provided directly by Python for the training of such method.

We then trained the model on the training set and reported below the test set metrics:

  • MRR@100 : 0.035
  • HR@100 : 0.033
  • NDCG@100 : 0.025
  • MAP@100 : 0.001

Markov Chains

This method revolves around the concept of Markov-Chains. The score s(a,b) can be computed as:

In a more concrete way, the score is the number of co-occurrences of the two consecutive items in a sequence.

I added below a simple implementation using the Counter data structure provided directly by Python for the training of such method.

We then trained the model on the training set and reported below the test set metrics:

  • MRR@100 : 0.076
  • HR@100 : 0.036
  • NDCG@100 : 0.030
  • MAP@100 : 0.001

Sequential Rules

Finally, the last method is very similar to the first one, the difference is that it will put more importance during the training on items that are close. The score s(a,b) can be computed as:

In a more concrete way, the score is the number of co-occurrences of the two items in a sequence multiplied by a factor. This factor is the max length of all the sequences (in our case 300, as we filtered out the longer sequences) minus the distance between the two items. Hence the closer the item are the bigger the factor will be and hence their score.

I added below a simple implementation using the Counter data structure provided directly by Python for the training of such method.

We then trained the model on the training set and reported below the test set metrics:

  • MRR@100 : 0.036
  • HR@100 : 0.033
  • NDCG@100 : 0.025
  • MAP@100 : 0.001

Neighborhood Methods🏠

Item-KNN (I-KNN)

Item-KNN uses the same prediction methods as the simple models above. Indeed, we will predict the next item only based on the last item of a sequence of items seen by the user. To do so we will return the closest item to the last one.

To get the closest item we will train a KNN on the items. First, we transform each item to a vector of the size of the number of users and we put a ‘1’ at the position i if the ith user has seen the item ‘0’ otherwise. Secondly, using those representation, to know the closest item we simply compute the distances (using cosine distance) between all the items and get the k smallest distances (to get k recommendations). To ease the computation, one can use some tricks such as Locality Sensitive Hashing (LSH).

We then trained the model on the training set and reported below the test set metrics:

  • MRR@100 : 0.189
  • HR@100 : 0.310
  • NDCG@100: 0.167
  • MAP@100: 0.006

Sequence-KNN (S-KNN)

(EDIT: Note that this method is NOT a Time-dependent method. As it is another neighborhood method, I added it for completeness and comparison)

S-KNN is the first method that I will present in this series that actually use all the sequence of item seen by the user and not only the last to make prediction (finally ✌️).

To get the next item from an input sequence s, we will train a KNN on the sequences. First, transform each sequence to a vector of the the size of the number of items and we put a ‘1’ at the position i if the ith item is in the sequence ‘0’ otherwise. Secondly, using those representation, we get the n closest sequences (let’s name them S) by computing the distances (using cosine distance) between all the sequences. Finally, we can obtain a new vector by summing all the vectors in S and weighting them by their similarity with s. Hence, we obtain a new vector with a size of the number of items, to get the next item we can take the argmax of it.

We then trained the model on the training set and reported below the test set metrics:

  • MRR@100 : 0.088
  • HR@100 : 0.269
  • NDCG@100: 0.126
  • MAP@100: 0.003

Results

Here’s a summary of the different performances:

Summary of the different methods

We can see that the best results are the I-KNN method for the MRR, the HR, the NDCG and the MAP!

Conclusion

We can conclude from this article, that neighborhood methods are much better than the “simple methods” I described on every metrics! But in a more practical sense implementing the I-KNN (the best method for now 😜) can be harder to scale and become impossible to calculate in a reasonable time without Locality Sensitive Hashing (LSH). However, for production purposes, one could still prefer the simple methods as they are easily implementable, easy to understand and can be much faster in terms of computational time!

In my next post, I will present other methods such as matrix factorization, CNN and much more, so stay tuned!

References:

  1. SHOUJIN WANG, University of Technology Sydney; Macquarie University, Australia LONGBING CAO, University of Technology Sydney, Australia YAN WANG, Macquarie University, Australia, A Survey on Session-based Recommender Systems , Feb 2019
  2. MALTE LUDEWIG, TU Dortmund, Germany DIETMAR JANNACH, AAU Klagenfurt, Austria, Evaluation of Session-based Recommendation Algorithms, Oct 2018

--

--