Implementation of deep generative models for recommender systems in Tensorflow🔮

Implementation of VAEs and GANs

Quentin Bacuet
Snipfeed
Published in
10 min readApr 29, 2019

--

This article is the sequel of my last one, I will show how to implement the VAE and the GAN with a code example for the recommendation systems. I will focus on the implementation rather than the theory behind it. Nevertheless, I will still give some insights into the loss and the main concept behind each model.

We will implement the models using the new execution mode in Tensorflow called eager execution.

The two metrics that we will use are the NDCG, which measures the quality and utility of the order of the items in our recommendations and the personalization index measures how unique recommendations are to a user.

The models are trained using the MovieLens 20M dataset.

Sample from the ratings file

VAE

Variational Autoencoders (VAE) decomposes into two parts: the encoder, which reduces the shape of the data with a bottleneck, and the decoder, which transforms the encoding back into its original form. Between these two NN, there is a sample layer, where the output of the decoder is decomposed into two vectors: the mean and the variance, then those vectors are used for Gaussian sampling. This sample is used as an input for the decoder.

Unlike other models, VAE gives a probability distribution over all items in one forward path.

VAE diagram

In the context of recommendation systems, they can be used to predict new recommendation. The input and the output are both the click vector (it is usual for an AE that the input and the output are the same) and we will use dropout after the input layer. This means that the model will have to reconstruct the click vector as some elements from the input will be missing, hence learning to predict the recommendation for a given click vector.

I described the architecture in the diagram below. As explained above, the input and output are the same.

In this part, I will define the loss used. The items of interest to a user were modeled as a multinomial distribution. To be more precise it is modeled as multinomial cell probabilities (an extension of the binomial case), each item has a probability pᵢ of being of interest for the user and xᵢ is the number of time the item i is picked. Therefore, the joint distribution of the vector of the items picked (x₁, x₂, . . . , xₙ) is distributed as a multinomial distribution, with pmf:

Multinomial distribution

To obtain the loss, I used the maximum likelihood estimation (MLE) of the multinomial cell probabilities. This is equivalent to maximizing the pmf of the multinomial distribution. To make everything clearer I used the log-likelihood. It is defined as:

Log-likelihood of the multinomial distribution

We can now take the argmax of the log-likelihood. As it is with respect to the probabilities pᵢ, the two first terms of the equation will vanish.

Argmax of the LL

If we consider f(x) as the output of the decoder and the last layer as a softmax layer (such that the output is a probability distribution). The loss can be defined as:

Loss function

This loss is not sufficient as proved in other articles. As we would like to have to a certain extent, a smooth interpolation between the different points (between the implicit clusters of users in our case) in the latent variable space. We will have to add a regularization term.

Latent space for different losses— Source

To do so we add the KL divergence of a prior distribution of the latent space. In our case, it is a multivariate Gaussian with a diagonal covariance matrix, i.e. a collection of n independent Normal random variables. This regularization term, the KL divergence, will penalize the model if the prior distribution is far from the Gaussian one. It will force the latent space to be closely distributed to Normal distribution.

KL-Divergence of the Gaussian distribution

To be able to control the strength of the regularization we multiply it by β.

Final Loss

Finally, to use back-propagation and compute the gradient across the sampling layer, we need to use a trick as it is not direct to calculate the gradient. The reparameterization trick makes it possible to overcome this problem by using the equation z = ε × σ + μ, with ε ~ N(0,1). The derivation of σ and μ is now possible.

Vanilla VAE — Implementation

In this section I will describe step by step how to implement a VAE.

The encoder net is defined below, using tf.keras.Sequential. It is a simple NN with an input layer of the size of the number of items a dense layer with a tanh activation and an output layer with linear activation.

The decoder net is a simple NN with an input layer of the latent dimension, a dense layer with a tanh activation and an output layer of size number of items.

Below we define some helpers function. The encode function will take a click vector and output the mean and the variance (log of the variance). The decode function will take a Gaussian sample (z). The reparameterization function is the implementation of the trick described above (note that we multiply by 0.5 as we take the exponential).

The loss is defined below. We simply implement the loss described above.

Using the implementation above we can train the model using the Adam Optimizer. Below we added the NDCG@100 (defined in my last article) of the VAE for the validation set.

Learning curve of the NDCG@100 on the validation set

We calculated the NDCG and the personalization on the test set:

  • NDCG@100: 0.413
  • Personalization: 0.102

Deep VAE

Now, we will test a new VAE architecture with a deeper encoder and decoder to see if we can actually improve the metrics.

First, we define a deep encoder. In our case, the model has 3 layers with decreasing size until the latent space.

The deep decoder net has the same architecture as the encoder, just oriented in the other direction.

Using the implementation above we can train the model using the Adam Optimizer. Below we added the NDCG@100 of the deep VAE for the validation set.

Learning curve of the NDCG@100 on the validation set

We calculated the NDCG and the personalization on the test set:

  • NDCG@100: 0.380
  • Personalization: 0.099

Deep Decoder VAE

One could also ask if using only a deep decoder or encoder could get better results than the deep VAE in which both of its parts are deep. To answer this, I first tested the deep decoder and a shallow encoder.

Deep Decoder VAE

I combined the deep encoder and shallow decoder implemented above. I trained the model using the Adam Optimizer. Below I added the NDCG@100 of the Deep Decoder VAE for the validation set.

Learning curve of the NDCG@100 on the validation set

We calculated the NDCG and the personalization on the test set:

  • NDCG@100: 0.375
  • Personalization: 0.095

We get results slightly worst that with the Deep VAE.

Deep Encoder VAE

As we tested with a deep encoder, we can now try with only a deep encoder.

I combined the shallow encoder and deep decoder implemented above.

We can train the model using the Adam Optimizer. Below I added the NDCG@100 of the Deep Encoder VAE for the validation set.

Learning curve of the NDCG@100 on the validation set

We calculated the NDCG and the personalization on the test set:

  • NDCG@100: 0.403
  • Personalization: 0.108

We get results slightly better that with the Deep VAE.

Deep VAE Split Train

Finally, the last model related to VAE is the Deep VAE split train. As seen above, training only a deep encoder gives better results than training with a deep decoder and encoder. Therefore, I will try an alternative way of training. First, I will train a deep encoder with a shallow decoder then after a few epochs, I will swap the shallow decoder with a deeper one.

Hence I decomposed the training into 3 parts:

1. Train the deep encoder and the shallow decoder

2. Train the deep decoder with the deep encoder fixed (only the decoder is trained)

3. Train both the deep decoder and the deep encoder.

Finally, we can train the model using the Adam Optimizer. Below I added the NDCG@100 of the Deep VAE Split Train for the validation set.

Learning curve of the NDCG@100 on the validation set

We calculated the NDCG and the personalization on the test set:

  • NDCG@100: 0.397
  • Personalization: 0.121

We get results slightly better that with the Deep VAE.

GAN

GANs were first introduced in 2014 by Ian Goodfellow. Those special types of models are composed of two parts: the discriminator and the generator. In the case of recommendation systems, the generator will take a click vector and output a recommendation vector. The discriminator will take those created vectors and the original click vectors and output if they are likely to be real or not.

The generator and the discriminator are in competition against each other. The learning is modeled as a zero-sum game. The learning of those models is hard in practice, with a lot of convergence issues.

GAN diagram

Vanilla GAN

In this section I will describe step by step how to implement a GAN with respect to the recommendation systems.

The generator and the discriminator is defined below, using the tf.keras.Sequential. The generator is a NN with an input layer of the size of the number of items and a dense layer with an output layer with a softmax activation. The discriminator is a NN with an input layer of the size of the number of items and a dense layer with an output layer with a sigmoid activation.

Next, I defined some helper functions. The generated vectors are masked by the users (multiply the recommendation vector by the click vector). This was useful during training to reduce convergence problems.

The loss is the classic GAN loss. The discriminator is the binary cross entropy loss where the positive labels are associated with the originals click vectors and the negative labels to the generated recommendations.

Discriminator loss

The generator loss is defined as the binary cross entropy, where the correct labels are associated with the output of the discriminator of the generated recommendations.

Generator loss

Below, I implement those losses.

The training is done in an alternative fashion using the Adam Optimizer. For each batch, train the generator with the discriminator fixed (only update the generator parameters) and then train the discriminator with the generator fixed.

Learning curve of the losses

Below, I plotted the NDCG@100 on the validation set.

Learning curve of the NDCG@100 on the validation set

We calculated the NDCG and the personalization on the test set:

  • NDCG@100: 0.260
  • Personalization: 0.443

WGAN-GP

GANs are known for their instability. To overcome this problem, a lot of research has been done around new loss functions and models. The Wasserstein GAN-Gradient Penalty (WGAN-GP) is one of them. The main differences between both models are the output of the discriminator and its loss. The output activation is not a sigmoid anymore, the output is unbounded. Concerning the loss, this model revolves around a new constraint on the discriminator D(x): it has to be a 1-Lipschitz function. A function is 1-Lipschitz function if and only if the norm of its gradient is at most 1. To ensure this constraint we will have to add a regularization term that will penalize the model if the norm of the gradient of the discriminator is far from 1 (the gradient will only be evaluated at 1 point).

The discriminator loss is now, with ε ~ U(0,1):

WGAN-GP discriminant loss

The loss of the generator for the WGAN-GP is very similar to the GAN.

The training is similar to the GAN.

Learning curve of the losses

Below, I plotted the NDCG@100 on the validation set.

Learning curve of the NDCG@100 on the validation set

We calculated the NDCG and the personalization on the test set:

  • NDCG@100: 0.279
  • Personalization: 0.429

We see that we get slightly better result than with the GAN.

Conclusion

I have shown how to implement various models with the Tensorflow eager mode. I have shown (on the MovieLens 20M) that a simple shallow VAE gives better results than deeper VAE. The GAN models are less effective than VAE and can easily collapse (need to re-train from scratch). When thinking beyond research and putting those models in production, GAN seems to be impractical in the case of recommendations that are automatized and computed on a daily basis.

References

  1. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville: Improved Training of Wasserstein GANs, 2017
  2. Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, Tony Jebara,Variational autoencoders for collaborative filtering, 2018.
  3. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville, Improved Training of Wasserstein GANs, 2017.
  4. http://statweb.stanford.edu/~susan/courses/s200/lectures/lect11.pdf
  5. https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

--

--