Design a Movie Recommendation System with using Graph Database (neo4j) — part 1

9 min readOct 28, 2018

If you want to design a recommendation system based on not only similarity of movies but also user behaviors, I think graph databases are very useful for this case. In this article, Python is used to design a data pipeline to ingest the data into graph database(neo4j).

We use cosine similarity to calculate how similar a movie is to another movie. Firstly, We will create a movies matrix and define a recommendation function with using movies matrix. After making some checks whether recommendations work good, we will prepare other datasets. Totally, we will create 7 datasets to design a recommendation database in graph db.

I work with movielens data. It can be downloaded from grouplens website (https://grouplens.org/datasets/movielens/). I try to explain every steps and details. Here is the road plan;

1- Introduction
2- Load movielens data
3- Prepare data for movies matrix
4- Check recommendations
5- Prepare other datasets
6- Import data to neo4j
7- Query for recommendation

1- Introduction

Basic structure of the movie recommendation database is something like in the pic. We find movies which is already watched by user and then find similar movies which has not been watched by user yet. Before we recommend a movie we also check movies genres whether it includes the user’s favorite genre or not. We find the included ones and sort them based on their ratings. And finally we get the 5 best results.

We create 7 datasets to build this structure. 3 of them for nodes (Users, Movies, Genres) and others for relationships. You can find details of tables below;

Nodes
1- users (userId): It includes users’ id and it has only one column. It is created with using “ratings.csv” data. We create “Users” node and it will has relations with “Movies” and “Genres” nodes
2- movies (movieId, title, rating_mean): It includes movies’ id, title and rating_mean fields. It is created with using “movies.csv” data. “Movies” node will has relations with “Users” and “Genres” nodes and it has also relationship to itself based on similarity
3- genres (genres): It is small data it has 19 rows and it keeps genres

Relationships
1- users_movies (userId, movieId, rating): It uses to create a relationship between “Users” and “Movies” nodes.
2- movies_genres (movieId, genres): It uses to create a relationship between “Movies” and “Genres” nodes.
3- users_genres (userId, genres): It uses to create a relationship between “Users” and “Genres” nodes. “genres” is a calculated field. It includes the favorite genre of the users. To calculate the favorite genre, I use count of the genres from movies which is already watched by users. I thought to use movies’ ratings but after making some checks I decided to use counts
4- movies_similarity (movieId, sim_movieId, relevance): It is the most critical dataset in this pipeline. It includes 5 rows for each movies and they are similar movies ids. I calculate similarity through movies. I use 3 groups of similarity and they are tag similarity, genre similarity and rating (rating, year, rating count) similarity. Then I mix them. It will be explained in details.

2- Load movielens data

Import modules

import pandas as pd
import numpy as np
import datetime
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity

We use 3 datasets, they are ‘genome-scores.csv’, ‘movies.csv’ and ‘ratings.csv’

genome_scores_data = pd.read_csv(‘genome-scores.csv’) 
movies_data = pd.read_csv(‘movies.csv’) 
ratings_data = pd.read_csv(‘ratings.csv’)

Review data

genome_scores_data.head()

movies_data.head()

ratings_data.head()

3- Prepare data for movies matrix

We create 3 dfs (mov_tag_df, mov_genres_df, mov_rating_df) and calculate 3 cosine similarities for each of them. After that, we mix them and we obtain the movies matrix. While we mix the datasets, we use a formula (mov_tag_df*0.5+mov_genres_df*0.25+mov_rating_df*0.25). In this case tags are the most important data to calculate similarity so it should effect to similarity calculation more than others. Ok, let’s start to create dfs;

mov_tag_df

mov_tag_df is created with using “genome_scores.csv”. We pivot the data to compare movies through the tags.

scores_pivot = genome_scores_data.pivot_table(index = ["movieId"],columns = ["tagId"],values = "relevance").reset_index()scores_pivot.head()

We need to join “scores_pivot” df with “movies_data” df to get all movieIds. Then we fill null values and drop columns which are not used

#join
mov_tag_df = movies_data.merge(scores_pivot, left_on=’movieId’, right_on=’movieId’, how=’left’)mov_tag_df = mov_tag_df.fillna(0) 
mov_tag_df = mov_tag_df.drop(['title','genres'], axis = 1)mov_tag_df.head()

mov_genres_df

mov_genres_df is created with using “movies.csv”. We split genres field for each movies and then we create columns for each genres. We define a function to split genres column and check it if it exists or not

def set_genres(genres,col):
    if genres in col.split('|'): return 1
    else: return 0

Now we are ready to create genres columns

mov_genres_df["Action"] = mov_genres_df.apply(lambda x: set_genres("Action",x['genres']), axis=1)
mov_genres_df["Adventure"] = mov_genres_df.apply(lambda x: set_genres("Adventure",x['genres']), axis=1)
mov_genres_df["Animation"] = mov_genres_df.apply(lambda x: set_genres("Animation",x['genres']), axis=1)
mov_genres_df["Children"] = mov_genres_df.apply(lambda x: set_genres("Children",x['genres']), axis=1)
mov_genres_df["Comedy"] = mov_genres_df.apply(lambda x: set_genres("Comedy",x['genres']), axis=1)
mov_genres_df["Crime"] = mov_genres_df.apply(lambda x: set_genres("Crime",x['genres']), axis=1)
mov_genres_df["Documentary"] = mov_genres_df.apply(lambda x: set_genres("Documentary",x['genres']), axis=1)
mov_genres_df["Drama"] = mov_genres_df.apply(lambda x: set_genres("Drama",x['genres']), axis=1)
mov_genres_df["Fantasy"] = mov_genres_df.apply(lambda x: set_genres("Fantasy",x['genres']), axis=1)
mov_genres_df["Film-Noir"] = mov_genres_df.apply(lambda x: set_genres("Film-Noir",x['genres']), axis=1)
mov_genres_df["Horror"] = mov_genres_df.apply(lambda x: set_genres("Horror",x['genres']), axis=1)
mov_genres_df["Musical"] = mov_genres_df.apply(lambda x: set_genres("Musical",x['genres']), axis=1)
mov_genres_df["Mystery"] = mov_genres_df.apply(lambda x: set_genres("Mystery",x['genres']), axis=1)
mov_genres_df["Romance"] = mov_genres_df.apply(lambda x: set_genres("Romance",x['genres']), axis=1)
mov_genres_df["Sci-Fi"] = mov_genres_df.apply(lambda x: set_genres("Sci-Fi",x['genres']), axis=1)
mov_genres_df["Thriller"] = mov_genres_df.apply(lambda x: set_genres("Thriller",x['genres']), axis=1)
mov_genres_df["War"] = mov_genres_df.apply(lambda x: set_genres("War",x['genres']), axis=1)
mov_genres_df["Western"] = mov_genres_df.apply(lambda x: set_genres("Western",x['genres']), axis=1)
mov_genres_df["(no genres listed)"] = mov_genres_df.apply(lambda x: set_genres("(no genres listed)",x['genres']), axis=1)

Drop columns which are no need anymore

mov_genres_df.drop(['title','genres'], axis = 1, inplace=True)mov_genres_df.head()

mov_rating_df

mov_rating_df is created with using “movies.csv” and “ratings.csv”. It includes year, rating and rating count information. We extract year information from “title” field. Then we group years through 0–5. We calculate mean of rating and counts of ratings for each movie. Then we group rating counts through 0–5. The reason why we group years and rating counts is that to reduce the scale and it helps to calculate better similarity. Firstly, we define a function to extract year information

def set_year(title):
    year = title.strip()[-5:-1]
    if unicode(year, 'utf-8').isnumeric() == True: return int(year)
    else: return 1800#add year field
movies['year'] = movies.apply(lambda x: set_year(x['title']), axis=1)movies = movies_data.drop('genres', axis = 1)movies.head()

Now we are ready to group years. For this we define a function

#define function to group years
def set_year_group(year):
    if (year < 1900): return 0
    elif (1900 <= year <= 1975): return 1
    elif (1976 <= year <= 1995): return 2
    elif (1996 <= year <= 2003): return 3
    elif (2004 <= year <= 2009): return 4
    elif (2010 <= year): return 5
    else: return 0movies['year_group'] = movies.apply(lambda x: set_year_group(x['year']), axis=1)#no need title and year fields
movies.drop(['title','year'], axis = 1, inplace=True)

We need to calculate mean and counts of ratings for each movies. After that we merge it with movies df

agg_movies_rat = ratings_data.groupby(['movieId']).agg({'rating': [np.size, np.mean]}).reset_index()agg_movies_rat.columns = ['movieId','rating_counts', 'rating_mean']agg_movies_rat.head()

#define function to group rating counts
def set_rating_group(rating_counts):
    if (rating_counts <= 1): return 0
    elif (2 <= rating_counts <= 10): return 1
    elif (11 <= rating_counts <= 100): return 2
    elif (101 <= rating_counts <= 1000): return 3
    elif (1001 <= rating_counts <= 5000): return 4
    elif (5001 <= rating_counts): return 5
    else: return 0agg_movies_rat['rating_group'] = agg_movies_rat.apply(lambda x: set_rating_group(x['rating_counts']), axis=1)#no need rating_counts field
agg_movies_rat.drop('rating_counts', axis = 1, inplace=True)mov_rating_df = movies.merge(agg_movies_rat, left_on='movieId', right_on='movieId', how='left')mov_rating_df = mov_rating_df.fillna(0)mov_rating_df.head()

movies matrix

We create 3 different datasets to calculate similarity. We use cosine similarity. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. I use cosine similarity because I think it often allows better meaning due to the high dimensionality. Before we calculate cosine similarity, we should set “movieId” field as index in the dfs

mov_tag_df = mov_tag_df.set_index('movieId')
mov_genres_df = mov_genres_df.set_index('movieId')
mov_rating_df = mov_rating_df.set_index('movieId')

I think we are ready. Let’s calculate cosine similarities!

#cosine similarity for mov_tag_df
cos_tag = cosine_similarity(mov_tag_df.values)*0.5#cosine similarity for mov_genres_df
cos_genres = cosine_similarity(mov_genres_df.values)*0.25#cosine similarity for mov_rating_df
cos_rating = cosine_similarity(mov_rating_df.values)*0.25#mix
cos = cos_tag+cos_genres+cos_rating

Now we can create a df with using “cos”.

cols = mov_tag_df.index.values
inx = mov_tag_df.indexmovies_sim = pd.DataFrame(cos, columns=cols, index=inx)movies_sim.head()

Finally we create “movies_similarity” df. It is one of the datasets which we will use in graph db. we calculates 5 the most similar movies for each movies. To do this we need to define a function

def get_similar(movieId):
    df = movies_sim.loc[movies_sim.index == movieId].reset_index(). \
            melt(id_vars='movieId', var_name='sim_moveId', value_name='relevance'). \
            sort_values('relevance', axis=0, ascending=False)[1:6]
    return df#create empty df
movies_similarity = pd.DataFrame(columns=['movieId','sim_moveId','relevance'])

Pull “movies_similarity” df we run a for loop. It runs for each movies and find 5 the most similar movies

for x in movies_sim.index.tolist():
    movies_similarity = movies_similarity.append(get_similar(x))movies_similarity.head()

The most important dataset for our recommendation database is ready. We will create other 6 sets. But before that let’s check the recommendations :)

4- Check recommendations

Data is ready to import graph db. We already calculate similarity of movies so we can define a function to get 5 the most similar movies for a movie.

def movie_recommender(movieId):
    df = movies_sim.loc[movies_sim.index == movieId].reset_index(). \
            melt(id_vars='movieId', var_name='sim_moveId', value_name='relevance'). \
            sort_values('relevance', axis=0, ascending=False)[1:6]
    df['sim_moveId'] = df['sim_moveId'].astype(int)
    sim_df = movies_data.merge(df, left_on='movieId', right_on='sim_moveId', how='inner'). \
                sort_values('relevance', axis=0, ascending=False). \
                loc[: , ['movieId_y','title','genres']]. \
                rename(columns={ 'movieId_y': "movieId" })
    return sim_df

#get recommendation for Toy Story
movie_recommender(1)

#get recommendation for Inception
movie_recommender(79132)

#get recommendation for X-Men
movie_recommender(3793)

#get recommendation for Lock, Stock & Two Smoking Barrels
movie_recommender(2542)

#get recommendation for Casino Royale
movie_recommender(49272)

#get recommendation for Hangover Part II
movie_recommender(86911)

#get recommendation for Eternal Sunshine of the Spotless Mind
movie_recommender(7361)

#get recommendation for Scream 4
movie_recommender(86295)

I think we are good for now. Recommendation system works great. I believe that results are very meaningful. We will continue to it in next part. We will prepare other datasets for recommendation database and import data to graph db. Finally we will write a query to obtain to best recommended movies to users in part-2!

You can find part-2 here (https://medium.com/@yesilliali/design-a-movie-recommendation-system-with-using-graph-database-neo4j-part-2-911becda9027)