Recommending Movies through Collaborative Recommendation

Taiho Higaki
4 min readNov 21, 2022

--

Recommendation algorithms are everywhere in our lives — from social media views and online shopping. It allows platforms to suggest content/items to users based on their preferences. One type of recommendation engine is a collaborative system — it takes in user ratings for products and recommends similar items that a user may like. Today, I replicated this article by Dr. Vaibhav Kumar, which recommends similar movies to the user. (This is my notebook on Google Colab)

Image from The Motley Fool

The way code calculates similarity is by using cosine similarity between two sequences. (also known as the Orchini similarity or the Tucker coefficient of congruence). For defining cosine similarity, the sequences are viewed as vectors in an inner product space, and the cosine similarity is defined as the cosine of the angle between them. It is originally defined by the Euclidean dot product formula, A・B = ||A|| ||B|| cos θ. Derived from this, cosine similarity is defined to be (A・B)/||A|| ||B||.

import numpy as npimport pandas as pddata = pd.read_csv('ratings.dat', names=['user_id', 'movie_id', 'rating', 'time'], engine='python', delimiter='::')movie_data = pd.io.parsers.read_csv('movies.dat', names=['movie_id', 'title', 'genre'], engine='python', delimiter='::')

The Python libraries I used are NumPy and pandas. The dataset comes from the MovieLens 1M Dataset.

Note: Because I got an error on line 3114 of the dataset, I have manually modified the movie name in the dataset.

ratings_mat = np.ndarray(shape=(np.max(data.movie_id.values), np.max(data.user_id.values)), dtype=np.uint8)ratings_mat[data.movie_id.values-1, data.user_id.values-1] = data.rating.valuesnormalised_mat = ratings_mat - np.asarray([(np.mean(ratings_mat, 1))]).T

Here, I make the rating matrix with rows as movies and columns as users, then normalize the matrix.

A = normalised_mat.T / np.sqrt(ratings_mat.shape[0] - 1)U, S, V = np.linalg.svd(A)

Here, Singular Value Decomposition (SVD) is used as a dimensionality reduction technique. SVD is a matrix factorization technique, which reduces the number of features in a dataset by reducing the dimension from M-dimension to N-dimension, where N < M.

In a recommendation algorithm, the SVD is the highlight of collaborative filtering. It uses a matrix structure where the rows represent a user, and each column represents an item. The elements in this matrix are the ratings that users give to items. The factorization of this matrix is done by SVD. The formula is as follows:

A is an m × n utility matrix, U is an m × r orthogonal left singular matrix, which represents the relationship between users and latent factors. S is an r × r diagonal matrix, which describes the strength of each latent factor. Finally, V is an r × n diagonal right singular matrix, which indicates the similarity between items and latent factors. The latent factors here are the characteristics of the items. The SVD method decreases the dimension of A (the utility matrix) by extracting its latent factors.

def top_cosine_similarity(data, movie_id, top_n=10):
index = movie_id - 1 # Movie id starts from 1 in the dataset
movie_row = data[index, :]
magnitude = np.sqrt(np.einsum('ij, ij -> i', data, data))
similarity = np.dot(movie_row, data.T) / (magnitude[index] * magnitude)
sort_indexes = np.argsort(-similarity)
return sort_indexes[:top_n]

This function calculates the cosine similarity, sorting by the most similar then returning the top N movies. It returns the dot product between movie_row and the transpose of the data array, divided by the magnitudes of the two arrays.

def print_similar_movies(movie_data, movie_id, top_indexes):
print('Recommendations for {0}: \n'.format(
movie_data[movie_data.movie_id == movie_id].title.values[0]))
for id in top_indexes + 1:
print(movie_data[movie_data.movie_id == id].title.values[0])
k = 50
movie_id = 10
top_n = 10
sliced = V.T[:, :k]
indexes = top_cosine_similarity(sliced, movie_id, top_n)
print_similar_movies(movie_data, movie_id, indexes)

print_similar_movies() print the top N movies that is the most similar. finally, we use k-principal components to represent movies, then find the most similar movies using cosine similarity.

This is the result that I got for GoldenEye, a spy film, directed by Martin Campbell in 1995. Now, compare this to Rush Hour: a comedy film centring around kidnapping and the police department. They are pretty similar, so we can confirm that our model works!

Key Takeaways:

  • How SVD and Cosine Similarity works — I will be able to use these for future projects
  • Reviewing linear algebra on matrix multiplication and factorization

--

--