Clustering and Collaborative Filtering — Visualizing clusters using t-SNE

Gabriel Tseng
5 min readJun 12, 2017

--

I’m going to explore clustering and collaborative filtering using the MovieLens dataset.

To begin with, I’m going to get a feel for the data by visualizing it and understanding where the clusters are. The full code for this can be found here.

Table of contents:
1. Setup

2. Dimensionality reduction

3. Investigating the clusters

1. Setup

First, let’s take a look at what the movielens dataset looks like. Downloading the data from MovieLens gets me 4 csv files:

links.csv, which contains the IMDb ID numbers of all the movies in the dataset.

tags.csv, which contains tags given by users to movies (e.g. ‘sufficiently explodey to be good’).

movies.csv, which contains the names, years and genres of all the movies in the dataset:

A peek at movies.csv

ratings.csv, which contains the a list of userId, the movieId of the movie they rated, what they rated the movie and when they did so:

A peek at ratings.csv

In order to cluster the data using sklearn, I wanted to make a 2 dimensional array, with all of the movies as rows, and as columns the users. However, this created a problem of missing data, since not every user will have rated every movie. To solve this, I made a 3D array, where the length of the third dimension is equal to the maximum rating. Then, for each 3rd dimension i, I filled the value with 1 if the user did rate the movie i/(maximum rating), and 0 if not. Then, I reshaped my data into one 2D array with users*ratings columns.

A graph of the 3d array I created. I then concatenated the arrays horizontally, so that there were Users*Ratings rows; these constituted the features for each movie.

Since there are 10 possible ratings (increments of 0.5 between 0 and 5), for 9066 movies and 671 users, I had a 9066 by 6710 matrix. To visualize this, I needed to reduce this to a 9066 by 2 matrix. I used two methods to do this.

2. Dimensionality Reduction

I’ll be using two methods to reduce the dimensionality of the data, Principal Component Analysis (PCA), and t-Distributed Stochastic Neighbor Embedding (t-SNE).

PCA (essentially) maps data onto a plane of a lower dimension, preserving the distances between the datapoints. The orientation of the lower dimensional plane is chosen to reflect the structure of the higher dimensional data.

t-SNE is a non-linear algorithm which considers the similarity of different objects, and uses this to decide on their distance in the 2D (or 3D) plane. A probability distribution (where similar objects have a high chance of being picked) is used to gauge the similarity of objects, and then plots the points on a lower dimension so that the probability distribution is similar to in the higher plane.

A comparison of PCA dimensionality reduction (on the right) and t-SNE (on the left). Clusters are much clearer when the dimensionality is reduced using t-SNE.

In terms of implementation, I used PCA to reduce the dimensions of the dataset to 50 (as recommended), before then using t-SNE to reduce the dimensions to 2. There are some hyperparameters to tune when implementing t-SNE, most significantly perplexity. Perplexity is a metric for how many neighbors a point has, and significantly affects the algorithm’s output:

The two dimensional outputs of t-SNE with different values of perplexity, between 5 and 100.

The recommended values of perplexity are between 5–50; since a perplexity of 50 yielded well defined clusters, I ran the t-SNE algorithm at a perplexity of 50 for 5000 iterations to ensure it converged. This created the two dimensional dataset I then analysed.

3. Investigating the clusters

Initially, zooming into the clusters didn’t seem too promising.

This seems suspicious. Why is Helter Skelter, a 1976 documentary about Charles Manson, clustered with Sing Street, a 2016 Drama/Romance about a Dublin teenager starting a band to win over an aspiring model?

However, diving further in did reveal some method to the madness. To get a baseline, I began by understanding the make up of the entire dataset, by finding the distribution of genres, ratings and release years across all the movies. This yielded the following:

Average_release_Year = 1991.873738Average_rating = 3.29205425251Average_genre_distribution = [('Drama', 4328), ('Comedy', 3307), ('Thriller', 1717), ('Action', 1543), ('Romance', 1541), ('Adventure', 1116), ('Crime', 1092), ('Horror', 872), ('Sci-Fi', 791), ('Fantasy', 653), ('Children', 582), ('Mystery', 537), ('Documentary', 487), ('Animation', 447), ('Musical', 394), ('War', 366), ('Western', 168), ('IMAX', 153), ('Film-Noir', 121), ('(no genres listed)', 17)]

where the genre distribution shows the genres (in order of popularity) and the number of movies assigned that genre.

Since I care about the difference from the mean, I then found the genres in each cluster which differed from this order. This means that if a cluster has the following genre distribution:

Cluster_genre_distribution = ([('Drama', 117), ('Comedy', 115), ('Action', 76), ('Thriller', 47), ('Romance', 41), ('Crime', 40), ('Adventure', 39), ('Sci-Fi', 36), ('Children', 16), ('Fantasy', 13), ('IMAX', 13), ('Mystery', 12), ('Horror', 11), ('Documentary', 11), ('Musical', 9), ('Animation', 9), ('War', 3), ('Western', 2), ('Film-Noir', 1)

Then, my ‘difference array’ would be

difference_array = [(2, 3, 'Action'), (3, 2, 'Thriller'), (5, 6, 'Crime'), (6, 5, 'Adventure'), (7, 8, 'Sci-Fi'), (8, 10, 'Children'), (10, 17, 'IMAX'), (12, 7, 'Horror'), (13, 12, 'Documentary'), (15, 13, 'Animation'), (16, 15, 'War'), (17, 16, 'Western')])

Where each element is

[genre_index_in_cluster, genre_index_in_total, genre]

(i.e. [2,3,’Action’] means that ‘Action’ is the second most common genre in the cluster, compared to the third most common amongst all the movies.)

I manually identified the clusters since rather than clustering the entire dataset, I only wanted to investigate the areas which clearly stood out. Taking the top 3 elements from the difference array yielded the following pretty picture:

Each annotation has the average year of release, the average rating, and the first 3 genres in the ‘difference array’. Across the entire dataset, the average release year of a movie is 1992, and the average rating is 3.29.

Only taking the first 3 genres which differ is clearly imperfect, as this favours the most common genres where many of the differences are likely to be in the less common genres, but it still does provide some clarity as to how the movies are distributed.

There isn’t a tremendous deviation in terms of ratings (except for a particularly bad batch of movies in the centre which have an average rating of 2.94, more than 2.5 standard deviations from the mean), but the year of release does seem to influence the clusters.

Part 2, where I use neural networks to implement a recommender system:

https://medium.com/@gabrieltseng/clustering-and-collaborative-filtering-implementing-neural-networks-bccf2f9ff988

--

--