Clustering and Collaborative Filtering — Implementing Neural Networks

Part 1, where I explore the dataset and visualize it using the t-SNE algorithm:

Having explored the data, I now aim to implement a neural network to predict how users in the MovieLens dataset will rate movies. As a benchmark, I will use’s neural network, which just consists of two dense layers (and an embedding layer which I will discuss).


  1. Understanding’s neural network
  2. Improving it
  3. Visualizing results

  1. Understanding’s neural network’s neural network consisted of 4 layers: two embedding layers, one for the users and one for the movies, and then two dense layers where the computations took place.

The dense layers are fairly straightforward; ‘dense’ means that every node in the layer is connected to every input layer.

A dense layer; note how every node in the first layer provides input to every node in the second (dense) layer.

What’s more interesting are the embedding layers. These layers are crucial to the neural network’s performance, improving the RMSE from 1.4 to 0.79. They allow a bias to be added to the neural network, which considers (on top of how users are rating movies) differences across users.

For instance, in terms of movies, The Shawshank Redemption is universally acclaimed as a good movie; whether or not someone enjoys crime movies, they are more likely to enjoy this movie. Similarly, in terms of users, if Frank dislikes movies, he is going to rate the movies lower even if they line up with his preferences. (Considering this is also why bias is better than just calculating the mean for a movie to gauge whether its good or bad). Adding a bias term allows for the network to consider these factors.

How does an embedding layer do this? An embedding layer takes the inputs (so for the movies, this would be the rankings of the users) and turns it into some high-dimensional vector. Because it considers all of the users, all users’ ratings will be considered when creating this high dimensional vector, thus accounting for the ‘goodness’ or ‘badness’ of the movie, as acknowledged by all users.

Note that the embedding layer needs to be trained; initially, this higher dimensional vector has random elements, but these are refined as the neural net sees more and more of the users ratings for each movie.

So, now that I understand this neural network, how does it perform?

My metric for success here is the mean squared error; it tells me how far of the predicted rating is from the actual rating (squared). This neural net has an MSE of ~0.79, which means each prediction has an error of 0.9 on average from the true rating.

How does this compare to other recommender systems? Well, its not bad (these 4 year old benchmarks have a best MSE of 0.89), but there’s overfitting which happens after the 6th epoch, which means there’s also room for improvement.

2. Improving the neural network

Link to code, where I also experiment with bidirectional RNNs

Given that the 4 layer neural network is already overfitting the data, I definitely don’t want to increase the complexity of the neural network. I’m also going to keep the embedding layers, because of their dramatic increase in the neural network’s performance.

The first change I’m going to introduce is to change the first dense layer to a recurrent layer.

2.1. Recurrent Neural Networks

The thing which makes recurrent neural networks unique is that they allow information to persist between samples, instead of just considering each sample independently. This is super intuitive when handling time series data (what happened at t-1 will be important in determining what is happening at t), but RNNs are very powerful with non-sequential data as well.

A node in a recurrent layer. Note how the node’s output is also used with the node’s input; this creates the ‘memory’ which makes RNNs so powerful.

There are two types of recurrent layers which I tested: an LSTM layer, which uses gates to define whether or not information is remembered, and a GRU layer, which essentially does the same thing but in less gates (this fantastic blog post explains RNNs in more depth.)

Implementing this change significantly increased the performance of my neural network, reducing validation loss from 0.7873 to 0.7699.

A comparison of mean squared validation error between the baseline neural network, and a neural network with a dense layer replaced by a recurrent LSTM layer. In addition to replacing the LSTM layer, I also reduce the number of nodes (to 50 nodes) in the network, introduce batch normalization and increase the dropout rate (to 0.8) -all to prevent overfitting, which still occurs.

3. Looking at interim layers

Link to code, where clusters_axis1 is scenario 1 below (nodes which most activate each movie) and clusters_axis0 is scenario 2.

I was curious to see exactly what the neural network was uncovering in its interim layers. To do this, I created a ‘sliced’ neural network, which stopped at the layer I was interested in. I then added the pre-trained weights, and predicted the outputs.

Since my LSTM interim layer had 50 nodes, the output of this interim layer was an array of length (number of validation ratings)*(50). There are then two ways I organized the interim outputs:

  1. By the node which most activated each movie
  2. By the movies most activated by each node

This should yield very similar movies, but the second method allows me to only see the maximally activated movies. Considering the top 20 movies most activated by each node (removing duplicates), lets look at the first two nodes:

Movies: ['Silence of the Lambs, The (1991)', 'Lord of the Rings: The Return of the King, The (2003)', 'Dark Knight, The (2008)', 'Star Wars: Episode IV - A New Hope (1977)', 'Godfather, The (1972)', 'Back to the Future (1985)', 'Casino (1995)', '12 Angry Men (1957)', 'On the Waterfront (1954)', 'Naked Gun: From the Files of Police Squad!, The (1988)', 'Red (2010)', "Howl's Moving Castle (Hauru no ugoku shiro) (2004)", 'Repo Man (1984)', 'Visitor, The (2007)', 'Ghost in the Shell (K\xc3\xb4kaku kid\xc3\xb4tai) (1995)', 'Porco Rosso (Crimson Pig) (Kurenai no buta) (1992)', 'For the Birds (2000)', 'Paperman (2012)']
Average Release Year: 1989.53333333
Average Rating: 4.17958836802
Movies: ['House on Haunted Hill (1999)', 'Speed 2: Cruise Control (1997)', 'Batman & Robin (1997)', 'Godzilla (1998)', 'Cats & Dogs (2001)', 'Battlefield Earth (2000)', 'Stop! Or My Mom Will Shoot (1992)', 'Richie Rich (1994)', 'Spy Kids 3-D: Game Over (2003)', '10,000 BC (2008)', 'Police Academy 5: Assignment: Miami Beach (1988)', 'Superman IV: The Quest for Peace (1987)', "Big Momma's House (2000)", 'Thirteen Ghosts (a.k.a. Thir13en Ghosts) (2001)', 'Soul Man (1986)', 'Message in a Bottle (1999)', "Eight Crazy Nights (Adam Sandler's Eight Crazy Nights) (2002)"]
Average Release Year: 1996.6
Average Rating: 1.72140425633

It’s clear that there’s a very strong divide along the rating of the movie (this continues for all the nodes), as well as release year. As with when I was clustering before, its less clear that there is a divide along genres. This indicates that genres are less significant than how good a movie is in determining whether a user will rate it highly.

Plotting the data onto the t-SNE distribution didn’t yield anything meaningful, suggesting the two algorithms had different approaches to grouping the data.

4. (Bonus!) A deep and wide model

Link to code.

After creating the RNN, I came across this paper, which suggests a ‘deep and wide’ model will deliver the best suggestions for users.

Their reasoning is that the deep model I created will only suggest popular movies, for which the model has lots of data. This limits the diversity of recommendations (a problem which Spotify has also tackled), adding a ‘wide’ component to the model can help reduce this effect.

My shot at a wide and deep model. It’s not perfect, because feature engineering should be used on the wide layer to increase the number of inputs; I simply didn’t have the data to do this.

Surprisingly, this model had much less overfitting, but ultimately was slightly worse than the LSTM model I trained above:

Training the deepwide model. The model took longer to train because I had to use a reduced learning rate, as it had a tendency to blow up. This model had a validation loss of 0.7890.

This is the same result that the researchers from Google got in the deep wide paper; the AUC score of their deep wide model was slightly worse than their deep model, but this model had a slightly better ‘online acquisition gain’; so this model should be better at suggesting interesting movies to you!