Neural Collaborative Filtering

Supercharging collaborative filtering with neural networks

Abhishek Sharma
Towards Data Science

--

In the era of information explosion, recommender systems play a pivotal role in alleviating information overload, having been widely adopted by many online services, including E-commerce, streaming services, and social media sites. The main component of these recommender systems is Collaborative Filtering(CF) with implicit feedback. Matrix factorization is the most used variation of Collaborative filtering. It uses a fixed inner product of the user-item matrix to learn user-item interactions. Neural Collaborative Filtering(NCF) replaces the user-item inner product with a neural architecture. By doing so NCF tried to achieve the following:

  1. NCF tries to express and generalize MF under its framework.
  2. NCF tries to learn User-item interactions through a multi-layer perceptron.

Let start with the basics of recommendation systems.

  1. Collaborative filtering models user preference on items based on their past interactions.
  2. Matrix Factorisation(MF) represents user/item as a vector of latent features which are projected into a shared feature space. In this feature space, the user-item interactions could be modeled using the inner product of user-item latent vectors. This vanilla implementation of MF can be enhanced by integrating it with neighbor based models, combining it with topic models of item content and extending it to factorization machines for general modeling of features.
  3. Implicit Feedback indirectly reflects the user’s preference through watching videos, product purchases, and clicks. The advantage of using implicit feedback is that it is easier to collect and is in surplus. The disadvantage is that there no readily available negative feedback.
  4. Explicit Feedback consists of ratings and reviews. It’s a direct feedback and the negative feedback or the likeliness of a product is readily available in terms of rating.

Despite the effectiveness of matrix factorization for collaborative filtering, it’s performance is hindered by the simple choice of interaction function - inner product. Its performance can be improved by incorporating user-item bias terms into the interactiion function. This proves that the simple multiplication of latent features (inner product), may not be sufficient to capture the complex structure of user interaction data.

This calls for designing a better, dedicated interaction function for modeling the latent feature interaction between users and items. Neural Collaborative Filtering (NCF) aims to solve this by:-

  1. Modeling user-item feature interaction through neural network architecture. It utilizes a Multi-Layer Perceptron(MLP) to learn user-item interactions. This is an upgrade over MF as MLP can (theoretically) learn any continuous function and has high level of nonlinearities(due to multiple layers) making it well endowed to learn user-item interaction function.

2. Generalizing and expressing MF as a special case of NCF. As MF is highly successful in the recommendation domain, doing this will give more credence to NCF.

In the next section, we will formally define the recommendation problem and create a basic template to solve it.

Problem Statement

Given a set of users U = {u = 1, …, U}, a set of items I = {i = 1, …, I}, and a log of the users’ past preferences of items O = (u, i, y), our goal is to recommend to each user u a ranked list of items that will maximize her/his satisfaction. y can be either 1(Case-1) or 0(Case-2).

Case 1: Observed entries:
It means the user u, have interacted with the item i, but does not mean that u like i.

Case 2: Unobserved entries:
It does not mean the user u dislike item i. Unobserved entries could be just missing data.

One drawback of using implicit feedback is that there is a natural scarcity for negative feedback.

Intuitively speaking the recommendation algorithms estimates the scores of unobserved entries in Y, which are used for ranking the items.

Formally speaking they calculate

Equation 1

y(u,i): predicted score for interaction between user u and item i
theta: model parameters
f(Interaction Function): maps model parameters to the predicted score

In order to calculate theta, an objective function needs to be optimized. The 2 most popular loss functions for the recommendation system are a pointwise and pairwise loss.

  1. Pointwise loss follows a regression framework by minimizing the squared loss between predicted score y_(u,i) and target value y(u,i). In order to account for negative feedback either all unobserved entries are considered as negative examples or some unobserved entries are sampled to be negative instances.
  2. Pairwise loss aims to rank observed entries higher than the unobserved ones. It achieves this by maximizing the margin between observed entries and unobserved entries.

To summarise, pairwise loss maximizes the margin between observed and unobserved entries in contrast to pointwise loss which aims to minimize the loss between predicted and target score.

NCF framework parameterizes the interaction function f using neural networks to estimate y_carat(u,i). It supports both pairwise and pointwise learning.

Matrix Factorisation (MF)

MF models the user-item interactions through a scalar product of user-item latent vectors. In mathematical terms, it is represented as follows

Equation 2

where

y_carat(u,i): prediction score (Look at Equation 1)
p(u): latent vector for user u
q(i): latent vector for item
K: the dimension of latent space

The example in Figure 1 illustrates the possible limitation of MF caused by the use of a simple and fixed inner product to estimate complex user-item interactions in the low-dimensional latent space. Keep in mind the following 2 conditions while going through Figure 1.

  1. As MF maps users and items to the same latent space, the similarity between users can be measured via an inner product or the cosine of the angle between their latent vectors.
  2. Jaccard coefficient is the ground truth (similarity of 2 users) that MF needs to recover
Jaccard coefficient
Figure 1: Let us first focus on the first three rows (users) in Figure 1a. It is easy to have s23(0.66) > s12(0.5) > s13(0.4). As such, the geometric relations of p1, p2, and p3 in the latent space can be plotted as in Figure 1b. Now, let us consider a new user u4, whose input is given as the dashed line in Figure 1a. We can have s41(0.6) > s43(0.4) > s42(0.2), meaning that u4 is most similar to u1, followed by u3, and lastly u2. However, if an MF model places p4 closest to p1 (the two options are shown in Figure 1b with dashed lines), it will result in p4 closer to p2 than p3, which unfortunately will incur a large ranking loss.

One way to resolve the issue is to use a large number of latent factors K. But increasing K can adversely hurt the generalization.

NCF overcomes this limitation by using Deep Neural Net (DNN) for learning the interaction function from data.

NCF Architecture

Figure 2: Neural Collaborative Filtering framework
  1. Input Layer binarise a sparse vector for a user and item identification where:
    Item (i): 1 means the user u has interacted with Item(i)
    User (u): To identify the user
  2. Embedding layer is a fully connected layer that projects the sparse representation to a dense vector. The obtained user/item embeddings are the latent user/item vectors.
  3. Neural CF layers use Multi-layered neural architecture to map the latent vectors to prediction scores.
  4. The final output layer returns the predicted score by minimizing the pointwise loss/pairwise loss.

NCF modifies equation 1 in the following way:

Equation 3: NCF’s construct to calculate user-item interaction

where

P: Latent factor matrix for users (Size=M * K)
Q: Latent factor matrix for items (Size=N * K)
Theta(f): Model parameters

Since f is formulated as MLP it can be expanded as

Equation 4: Modeling user-item interaction

where

Psi (out): mapping function for the output layer
Psi (x): mapping function for the x-th neural collaborative filtering layer

Equation 4 acts as the scoring function for NCF.

NCF’s loss function

Pointwise squared loss equation is represented as

where
y:
observed interaction in Y
y negative: all/sample of unobserved interactions
w(u,i): the weight of training instance (hyperparameter)

The squared loss can be explained if we assume that the observations are from a Gaussian distribution which in our case is not true. Plus the prediction score y_carat should return a score between [0,1] to represent the likelihood of the given user-item interaction. In short, we need a probabilistic approach for learning the pointwise NCF that pays special attention to the binary property of implicit data.

NCF uses a logistic /probit function at the output layer to solve for the above.

With the above settings, the likelihood function is defined as :

Taking the negative log of the likelihood function

which is nothing but the cross-entropy loss/log loss.

By employing a probabilistic treatment, NCF transforms the recommendation problem to a binary classification problem

To account for negative instances y- is uniformly sampled from the unobserved interactions.

In the next segment, we can see how GMF( a component of NCF) generalizes the MF framework

Generalized Matrix Factorisation (GMF)

The predicted output of the NCF can be expressed as

Equation 5

where
a-out: activation function
h: edge weights of the output layer

We can play with a-out and h to create multiple variations of GMF.

As you can see from the above table that GMF with identity activation function and edge weights as 1 is indeed MF. The other 2 variations are expansions on the generic MF. The last variation of GMF with sigmoid as activation is used in NCF.

NCF uses GMF with sigmoid as the activation function and learns h (the edge weights) from the data with log loss.

In the next segment, we will explain how NCF tries to model the user-item interaction using MLP

Multi-Layer Perceptron (MLP)

NCF is an example of multimodal deep learning as it contains data from 2 pathways namely user and item. The most intuitive way to combine them is by concatenation. But a simple vector concatenation does not account for user-item interactions and is insufficient to model the collaborative filtering effect. To address this NCF adds hidden layers on top of concatenated user-item vectors(MLP framework), to learn user-item interactions. This endows the model with a lot of flexibility and non-linearity to learn the user-item interactions. This is an upgrade over MF that uses a fixed element-wise product on them. More precisely, the MLP alter Equation 1 as follows

Equation 6

where:
W(x): Weight matrix
b(x): bias vector
a(x): activation function for the x-th layer’s perceptron
p: latent vector for the user
q: latent vector for an item

NCF uses ReLU as an activation function for its MLP part.

Due to multiple hidden layers, the model has sufficient complexity to learn user-item interactions as compared to the fixed element-wise product of their latent vectors(MF way).

NeuMF: A fusion of GMF and MLP

NCF has 2 components GMF and MLP with the following benefits

  1. GMF that applies the linear kernel to model user-item interactions like vanilla MF
  2. MLP that uses multiple neural layers to layer nonlinear interactions

NCF combines these models together to superimpose their desirable characteristics. NCF concatenates the output of GMF and MLP before feeding them into NeuMF layer.

Important points to notice

  1. GMF/MLP have separate user and item embeddings. This is to make sure that both of them learn optimal embeddings independently.
  2. GMF replicates the vanilla MF by element-wise product of the user-item vector.
  3. MLP takes the concatenation of user-item latent vectors as input.
  4. The outputs of GMF and MLP are concatenated in the final NeuMF(Neural Matrix Factorisation) layer.

The score function of equation 1 is modeled as

G: GMF
M: MLP
p: User embedding
q:
Item embedding

This model combines the linearity of MF and non-linearity of DNNs for modeling user-item latent structures through the NeuMF (Neural Matrix Factorisation) layer.

Due to the non-convex objective function of NeuMF,gradient-based optimization methods can only find locally-optimal solutions. This could be solved by good weight initializations. To solve this NCF initializes GMF and MLP with pre-trained models. There are 2 ways to do this

  1. Random Initialisation
    1. Train GMF+MLP with random initializations until convergence.
    2. Use model parameters of 1 to initialize NCF.
    3. The weights of the two models are concatenated for the output layer as
  2. GMF + MLP from scratch

where
h(GMF): h vector of the pre-trained GMF
h(MLP): h vector of the pre-trained MLP
alpha: Hyper-parameter determining the trade-off between the 2 pre-trained models

  1. GMF + MLP from scratch
    1. Adaptive Moment Estimation (Adam) adapts the learning rate for each parameter by performing smaller updates for frequent and larger updates for infrequent parameters. The Adam method yields faster convergence for
    both models than the vanilla SGD and relieves the pain of tuning the learning rate.
    2. After feeding pre-trained parameters into NeuMF, we optimize it with the vanilla SGD, rather than Adam. Adam needs to save momentum information for updating parameters. As the initialization with pre-trained networks does not store momentum information.

This completes the theory of NCF.

Python example

This segment uses NCF implementation from this library. First, install the library for recommendation by following the steps given in this.

df.head()
predictions.head()

Conclusion

NCF learns a probabilistic model that emphasizes the binary property of implicit data. We discussed how MF can be expressed and generalized under NCF (Using General Matrix Factorisation {GMF}). NCF explores the use of DNNs for collaborative filtering, by using a multi-layer perceptron (MLP) to learn the user-item interaction function. Lastly, we discussed a new neural matrix factorization model called NeuMF, which ensembles MF and MLP under the NCF framework; it unifies the strengths of linearity of MF and non-linearity of MLP for modeling the user-item latent structures. The last segment contains a working example of NCF. The code used is taken from the ncf_deep_dive notebook from Github. This repository contains many examples and is really useful. I would highly recommend you to go through them.

References

  1. NCF paper
  2. Microsoft recommendation library
  3. Code by original authors

--

--