Time dependent Recommender Systems (part 2) ⌛

Quentin Bacuet
Snipfeed
Published in
10 min readJul 24, 2019

Introduction

This article is the sequel of my last one, where I presented simple methods and some neighborhood models for time dependent recommandations. In this post, I will present two new classes of models, the matrix factorization-based methods (the MF-BPR and the MF-TOP) and the neural nets methods (the RNN and the CNN).

The 4 metrics that I will use are the NDCG, the MRR, the MAP and the HR. All of them are clearly explained in my previous article.

The models are trained using a filtered version of the MovieLens 20M dataset.

Sample from the ratings file

Matrix Factorization-based Methods

In this part, I will define two matrix factorization-based models. They are entirely described with one equation:

r(s,l,i) is the score of the ith item (the bigger the most relevant) with respect to the sequence s (i.e. a user) and the last item l of the sequence s. Once, we have the score for each i in I (the set of items), we can apply a simple argsort (from the bigger to the smaller) on the scores and take the k first elements to have the k most relevant item for that user. We define U as the embedding size of the session and V as the embedding size of the item.

The equation above can be decomposed into two parts. The first one is:

It takes as input the sequence of items s (the ‘long-term’ preferences of the user). wᵢ is a scalar that embodies the weight that we put on the long term prediction ), e_seq an embedding matrix for the sequence , of size: U x|I|, qᵢ a weight vector of size U and b₁,ᵢ a bias. s is the sequence of item transformed to a vector of the size of the number of items and where we added a ‘1’ at the position i if the ith item is in the sequence; ‘0’ otherwise (same as S-KNN).

The second part is:

It takes as input the last item in the sequence l and hence will take into account the ‘short-term’ preferences of the user. For the second part e_item is an embedding matrix for the items of size: V x |I|, qᵢ a weight vector of size V and b₂,ᵢ a bias. l is the last item transformed to a vector of the size of the number of items and where we added a ‘1’ at the position of the item (hence zero everywhere except at one position).

Then both part are multiplied by a scalar and added together. This equation define a score for the ith item. The bigger the score the more likely the ith item will be relevant for the user. For implementation purposes, instead of computing the score per item we output directly a vector with the size of the number of items:

Below I added the general code for the MF with the function from above.

We tested two different loss function that we are going to describe below. I used different notations: S_N is a set of negative items (every item except the expected one), r(s,l,i) is the score predicted for the expected item i, r(s,l,j) are the scores predicted for the item that are not expected and σ the sigmoid function.

MF-BPR

This loss first takes the difference between the score of the expected item and an unexpected item. Then we map the result from 0 to 1 with a sigmoid and then take the average on all the items in the negative sample multiplied by -1.

Consequently, the greater the difference between the positive and negative score item, the greater the output sigmoid will be (between 0 and 1), the greater the log output will be (between -inf and 0) and hence when multiplied by -1 the smaller the output will be. Hence, to minimize the loss, we would except the difference between the score of the positive item and the negative item to be maximized!

Below I added the specific code for the BPR loss function:

MF-TOP

The TOP loss first takes the difference between the score of an unexpected item and the expected item, then we map the result between 0 and 1 with a sigmoid. The second term is a regularization term that makes sure that the value of the unexpected score is low. Finally we take the average on all the items in the negative sample.

Consequently, if the difference between the positive and the negative score is big and the negative score is small then the overall loss will be small. The regularization term here is important to make sure that the score of the negative terms goes to 0.

Below I added the specific code for the TOP loss function:

Results

I did a hyperparameter optimization for the embedding size of both the session and item and plotted the different metrics on the validation set obtained for both the BPR and TOP loss.

HR:

MAP:

MRR:

NDCG:

Finally, we can see that the best models for both losses are the one with the biggest embedding size (700). Hence, we get the results on the metrics on the test set for the best model on the two losses:

BPR Loss
Top Loss

Neural Nets

In this part I will present the more advanced Time-dependent methods: the neural nets that model time dependencies in sequences such as CNN, RNN.

RNN

A recurrent neural network (RNN) is a class of NN where neurons inside the “same cell” are connected together. This allows the NN to model time dependencies inside a sequence input. The are broadly used for music and text datasets (for generation, classification) but can be used for any tasks involving sequences and time-dependencies between the element inside the inputs such as stock price forecasting or recommendations.

A RNN cell can be better understood if unrolled as in the representation below. It can be decomposed into t (the size of the sequence) “classic” neurons. Note that the parameters are shared between the neurons of the same cell, hence the size of the sequence doesn’t influence the number of parameters inside the cell.

Above we see, that each neuron (except the first one) has two different inputs, the input of the sequence, and the input of the neuron just before in the sequence. The exact formula for the most basic RNN is:

Hence, a RNN can be understood as multiple copies of the same neuron, each passing a representation of the sequence (an embedding) seen so far to the next neuron.

In our case we didn’t use the classic RNN as they suffer from vanishing and exploding gradient problems, we used LSTMs. The general principle stays the same they just use more complex internal structure to tackle the problem above by introducing real memory cells and forget gates.

LSTM cell

For the case of time-dependent recommendation, I used the architecture below:

RNN — Architecture, with the name of the layer and its output shape

I trained the model by taking each sequences of seen content by the user as input (without the last item) and the sequence shifted by 1 to the left (without the first item). I used the cross entropy loss with the Adam optimizer. Below, I plotted the training loss against the number of LSTM layers:

I did a small grid search on the number of LSTM layers and plotted the different metrics below.

HR:

MAP:

MRR:

NDCG:

We can see that the best length in my case is 1. Hence, we get the results on the metrics on the test set for the best model:

CNN

A convolutional neural network (CNN) is a class of feed-forward NN, in which the pattern of connection between the neurons is inspired by the animal visual cortex. Initially, they were used a lot in the fields involving images and videos.

Compared to fully connected NN, CNN is invariant to the translation, making them perfect for the modeling of data that intrinsically posses this property such as images, videos and recommendations (i.e. a 1-D image).

A CNN neuron can be better understood as a filter (in the sense of computer vision). The same filter will slide from one side to the other of the data sequence (using weight sharing). In the case of recommendation system one could directly use the convolution with the raw data like below:

The 1-D convolution. The green part is the input and the blue part the output.

But in my case I first used a embedding layer, this allowed me to use two different cnn — a horizontal one and a vertical one:

CNN — Architecture, with the name of the layer, its output shape, and an approximation of the number of parameters/weights.

We used E as the embedding size, F the number of filters, h the filter size and D the layer size of the fully connected layer.

The two CNNs will take as input the sequences embedded, i.e. a t x E matrix where each line represents an embedded item:

  • The vertical cnn will be F neurons where each of them is a filter of size h * E that will slide vertically on the embedded matrix. It will model the dynamics between the items in the sequence.
  • The horizontal cnn will be F neurons where each of them is a filter of size t that will slide horizontally on the embedded matrix. It is a weighted sum of the items along with each embedding latent dimension.

I trained the model by taking the last item as the target and the n items until the item before the last one as the input. I used the cross entropy loss with the Adam optimizer.

I did a grid search on the number of dense layers and the embedding size. I plotted the different metrics below.

Loss:

HR:

MAP:

MRR:

NDCG:

We can see that the best model in my case is the one with 1 dense layer and an embedding size of 500. Hence, we get the results on the metrics on the test set for the best model:

Comparison

We can now compare all our models:

The best model when looking at all the metrics is the convolutional neural net (CNN).

Conclusion

In this article and my previous one I presented 4 different classes of methods, the simple methods that use some simple rules with the last item of the sequence, neighborhood methods that use some distance metrics, matrix factorization methods that learn a latent representation of the sequences and items and finally the NN methods that learn complex non linear relations. CNN are well known for image related tasks, but here we’ve shown that they can yield the best performance for time dependent recommandations!

References

  1. SHOUJIN WANG, University of Technology Sydney; Macquarie University, Australia LONGBING CAO, University of Technology Sydney, Australia YAN WANG, Macquarie University, Australia, A Survey on Session-based Recommender Systems , Feb 2019
  2. MALTE LUDEWIG, TU Dortmund, Germany DIETMAR JANNACH, AAU Klagenfurt, Austria, Evaluation of Session-based Recommendation Algorithms, Oct 2018
  3. Jiaxi Tang, School of Computing Science Simon Fraser University British Columbia, Canada; Ke Wang School of Computing Science Simon Fraser University British Columbia, Canada, Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding, February 2018

--

--