Design a Movie Recommendation System with using Graph Database (neo4j) — part 2

Ali Yesilli
8 min readOct 29, 2018

--

In fist part we created movies matrix. It includes similarity rates 0–1 through movies. We created also “movies_similarity” dataset and it includes 5 rows for each movies. 5 movies of the most similar ones. We checked recommendation for some popular movies and I think result was good.

Now it is time to create other 6 datasets and work on graph database side. You can find part-1 here (https://medium.com/@yesilliali/design-a-movie-recommendation-system-with-using-graph-database-neo4j-part-1-2c4933f4da0a).

5- Prepare other datasets

We create 6 datasets, you can find details of datasets in first part. We create “users”, “movies”, “genres”, “users_movies”, “movies_genres”, “users_genres” datasets. We use “users”, “movies” and “genres” datasets for nodes and others for relationships. Let’s start!

users

We use ‘ratings.csv’ data and extract unique userIds.

users_df = pd.DataFrame(ratings_data['userId'].unique(), columns=['userId'])users_df.head()

movies

We use ‘movies.csv’ data for “movies” dataset. We need mean of ratings so first we calculate it and then merge it ‘movies.csv’ data

#create movies_df
movies_df = movies_data.drop('genres', axis = 1)
#calculate mean of ratings for each movies
agg_rating_avg = ratings_data.groupby(['movieId']).agg({'rating': np.mean}).reset_index()
agg_rating_avg.columns = ['movieId', 'rating_mean']#merge
movies_df = movies_df.merge(agg_rating_avg, left_on='movieId', right_on='movieId', how='left')
movies_df.head()

genres

genres = [
"Action",
"Adventure",
"Animation",
"Children",
"Comedy",
"Crime",
"Documentary",
"Drama",
"Fantasy",
"Film-Noir",
"Horror",
"Musical",
"Mystery",
"Romance",
"Sci-Fi",
"Thriller",
"War",
"Western",
"(no genres listed)"]
genres_df = pd.DataFrame(genres, columns=['genres'])genres_df.head()

users_movies

We use ‘ratings.csv’ data and we extract userId, movieId and rating fields

users_movies_df = ratings_data.drop('timestamp', axis = 1)users_movies_df.head()

movies_genres

We have genres information in ‘movies.csv’ data. But we should split it and then it should be unpivoted. So we will obtain multiple rows for each movies based on genres.

movies_genres_df = movies_data.drop('title', axis = 1)

To split genres field and unpivot it, we should define a function

#define a function to split genres field
def get_movie_genres(movieId):
movie = movies_genres_df[movies_genres_df['movieId']==movieId]
genres = movie['genres'].tolist()
df = pd.DataFrame([b for a in [i.split('|') for i in genres] for b in a], columns=['genres'])
df.insert(loc=0, column='movieId', value=movieId)
return df

Now we are ready, we use a for loop to extract genres for each movies

#create empty df
movies_genres=pd.DataFrame(columns=['movieId','genres'])
for x in movies_genres_df['movieId'].tolist():
movies_genres=movies_genres.append(get_movie_genres(x))
movies_genres.head()

users_genres

We use “users_genres” to get a relation between “Users” and “Genres” nodes. We need to find favorite movie genre for each users. To do this we need to analyze all movies which are watched by users and get genres of the movies then we can extract the favorite one which is using counts of the movie genres.

#join to movies data to get genre information
user_genres_df = ratings_data.merge(movies_data, left_on='movieId', right_on='movieId', how='left')
#drop columns that will not be used
user_genres_df.drop(['movieId','rating','timestamp','title'], axis = 1, inplace=True)
user_genres_df.head()

We need to define a function to obtain favorite movie genre for each users. It is based on counts of movie genres.

def get_favorite_genre(userId):
user = user_genres_df[user_genres_df['userId']==userId]
genres = user['genres'].tolist()
movie_list = [b for a in [i.split('|') for i in genres] for b in a]
counter = Counter(movie_list)
return counter.most_common(1)[0][0]

We need to run a for loop to extract favorite movie genre for each users

#create empty df
users_genres = pd.DataFrame(columns=['userId','genre'])
for x in user_df['userId'].tolist():
users_genres = users_genres.append(pd.DataFrame([[x,get_favorite_genre(x)]], columns=['userId','genre']))
users_genres.head()

6- Import data to neo4j

We create 7 datasets so we are ready to create recommendation database in neo4j. First let’s export our datasets as csv files.

users_df.to_csv('users.csv', sep='|', header=True, index=False)movies_df.to_csv('movies.csv', sep='|', header=True, index=False)genres_df.to_csv('genres.csv', sep='|', header=True, index=False)users_movies_df.to_csv('users_movies.csv', sep='|', header=True, index=False)movies_genres.to_csv('movies_genres.csv', sep='|', header=True, index=False)users_genres.to_csv('users_genres.csv', sep='|', header=True, index=False)movies_similarity.to_csv('movies_similarity.csv', sep='|', header=True, index=False)

We use neo4j as a graph database. I would like to give some information about neo4j.

Neo4j is a graph database management system developed by Neo4j, Inc. Described by its developers as an ACID-compliant transactional database with native graph storage and processing, Neo4j is the most popular graph database according to DB-Engines ranking, and the 22ⁿᵈ most popular database overall.(wikipedia)

You can download neo4j from here (https://neo4j.com/download/). It is easy to setup and start to use it. Its UI is very user friendly and there are a lot of documentations.

To create a new database, clik to “Add Graph” field. We create recommendation database. After create new database we need to change a setting to import files from all folders. To do this click to “Manage” field

Then go to “Settings” field and comment out to “dbms.directories.import=import”.

It looks like we are ready. Start the database and you will see it in the “Active database” field. Click to “Open Browser” to work on the database.

Let’s start to load nodes first

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///users.csv” AS row
FIELDTERMINATOR ‘|’
CREATE (:Users {userId: row.userId});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///movies.csv” AS row
FIELDTERMINATOR ‘|’
CREATE (:Movies {movieId: row.movieId, title: row.title, rating_mean: row.rating_mean});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///genres.csv” AS row
FIELDTERMINATOR ‘|’
CREATE (:Genres {genres: row.genres});

It is good to create indexes before create relationships

CREATE INDEX ON :Users(userId);CREATE INDEX ON :Movies(movieId);

Nodes are ready. It is time create relationships though nodes. We have 4 relationships.

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///users_movies.csv” AS row
FIELDTERMINATOR ‘|’
MATCH (user:Users {userId: row.userId})
MATCH (movie:Movies {movieId: row.movieId})
MERGE (user)-[:WATCHED {rating: row.rating}]->(movie);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///users_genres.csv” AS row
FIELDTERMINATOR ‘|’
MATCH (user:Users {userId: row.userId})
MATCH (genres:Genres {genres: row.genre})
MERGE (user)-[:FAVORITE]->(genres);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///movies_genres.csv” AS row
FIELDTERMINATOR ‘|’
MATCH (movie:Movies {movieId: row.movieId})
MATCH (genres:Genres {genres: row.genres})
MERGE (movie)-[:GENRES]->(genres);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///movies_similarity.csv” AS row
FIELDTERMINATOR ‘|’
MATCH (movie1:Movies {movieId: row.movieId})
MATCH (movie2:Movies {movieId: row.sim_moveId})
MERGE (movie1)-[:SIMILAR {relevance: row.relevance}]->(movie2);

It is done! We create recommendation database so we can start to work on it. We have 165790 nodes and 20329354 relationships

7- Query for recommendation

We use Cpyher Query Language to extract data from neo4j database.

Cypher is a declarative graph query language that allows for expressive and efficient querying and updating of a property graph. Cypher is a relatively simple but still very powerful language. Very complicated database queries can easily be expressed through Cypher.(wikipedia)

Let’s review a user and check movies which are watched by him

MATCH path = (u:Users)-[:WATCHED]->(m1:Movies)
WHERE u.userId =~’4'
RETURN u.userId, m1.title, m1.rating_mean
MATCH path = (u:Users)-[:FAVORITE]->(g:Genres)
WHERE u.userId =~’4'
RETURN u.userId, g.genres

He watched 28 movies and his favorite movie genre is ‘Action’. Let’s find similar movies based on the movies which are watched by him

MATCH path = (u:Users)-[:WATCHED]->(m1:Movies)-[s:SIMILAR]->(m2:Movies)
WHERE u.userId =~’4'
RETURN u.userId, m1.title, m2.title, m2.rating_mean

There are 140 movies (28*5). But we are interesting in only ‘Action’ movies

MATCH path = (u:Users)-[:WATCHED]->(m1:Movies)-[s:SIMILAR]->(m2:Movies),
(m2)-[:GENRES]->(g:Genres),
(u)-[:FAVORITE]->(g)
WHERE u.userId =~’4'
RETURN u.userId, g.genres, m1.title, m2.title, m2.rating_mean

We are almost there we need to filter movies which are already watched then sort them based on ratings and get 5 of them

MATCH (u1:Users)-[:WATCHED]->(m3:Movies)
WHERE u1.userId =~’4'
WITH [i in m3.movieId | i] as movies
MATCH path = (u:Users)-[:WATCHED]->(m1:Movies)-[s:SIMILAR]->(m2:Movies),
(m2)-[:GENRES]->(g:Genres),
(u)-[:FAVORITE]->(g)
WHERE u.userId =~’4' and not m2.movieId in movies
RETURN distinct u.userId as userId, g.genres as genres,
m2.title as title, m2.rating_mean as rating
ORDER BY m2.rating_mean descending
LIMIT 5

That’s it. We can use this query to recommend a movie to user. Let’s get recommendations for another user

MATCH path = (u:Users)-[:WATCHED]->(m1:Movies)
WHERE u.userId =~’1905'
RETURN u.userId, m1.title, m1.rating_mean
MATCH (u1:Users)-[:WATCHED]->(m3:Movies)
WHERE u1.userId =~’1905'
WITH [i in m3.movieId | i] as movies
MATCH path = (u:Users)-[:WATCHED]->(m1:Movies)-[s:SIMILAR]->(m2:Movies),
(m2)-[:GENRES]->(g:Genres),
(u)-[:FAVORITE]->(g)
WHERE u.userId =~’1905' and not m2.movieId in movies
RETURN distinct u.userId as userId, g.genres as genres,
m2.title as title, m2.rating_mean as rating
ORDER BY m2.rating_mean descending
LIMIT 5

I think it is done! We are ready to recommend a movie based on user’s last behaviors

--

--