Design a Movie Recommendation System with using Graph Database (neo4j) — part 2
In fist part we created movies matrix. It includes similarity rates 0–1 through movies. We created also “movies_similarity” dataset and it includes 5 rows for each movies. 5 movies of the most similar ones. We checked recommendation for some popular movies and I think result was good.
Now it is time to create other 6 datasets and work on graph database side. You can find part-1 here (https://medium.com/@yesilliali/design-a-movie-recommendation-system-with-using-graph-database-neo4j-part-1-2c4933f4da0a).
5- Prepare other datasets
We create 6 datasets, you can find details of datasets in first part. We create “users”, “movies”, “genres”, “users_movies”, “movies_genres”, “users_genres” datasets. We use “users”, “movies” and “genres” datasets for nodes and others for relationships. Let’s start!
users
We use ‘ratings.csv’ data and extract unique userIds.
users_df = pd.DataFrame(ratings_data['userId'].unique(), columns=['userId'])users_df.head()
movies
We use ‘movies.csv’ data for “movies” dataset. We need mean of ratings so first we calculate it and then merge it ‘movies.csv’ data
#create movies_df
movies_df = movies_data.drop('genres', axis = 1)#calculate mean of ratings for each movies
agg_rating_avg = ratings_data.groupby(['movieId']).agg({'rating': np.mean}).reset_index()agg_rating_avg.columns = ['movieId', 'rating_mean']#merge
movies_df = movies_df.merge(agg_rating_avg, left_on='movieId', right_on='movieId', how='left')movies_df.head()
genres
genres = [
"Action",
"Adventure",
"Animation",
"Children",
"Comedy",
"Crime",
"Documentary",
"Drama",
"Fantasy",
"Film-Noir",
"Horror",
"Musical",
"Mystery",
"Romance",
"Sci-Fi",
"Thriller",
"War",
"Western",
"(no genres listed)"]genres_df = pd.DataFrame(genres, columns=['genres'])genres_df.head()
users_movies
We use ‘ratings.csv’ data and we extract userId, movieId and rating fields
users_movies_df = ratings_data.drop('timestamp', axis = 1)users_movies_df.head()
movies_genres
We have genres information in ‘movies.csv’ data. But we should split it and then it should be unpivoted. So we will obtain multiple rows for each movies based on genres.
movies_genres_df = movies_data.drop('title', axis = 1)
To split genres field and unpivot it, we should define a function
#define a function to split genres field
def get_movie_genres(movieId):
movie = movies_genres_df[movies_genres_df['movieId']==movieId]
genres = movie['genres'].tolist()
df = pd.DataFrame([b for a in [i.split('|') for i in genres] for b in a], columns=['genres'])
df.insert(loc=0, column='movieId', value=movieId)
return df
Now we are ready, we use a for loop to extract genres for each movies
#create empty df
movies_genres=pd.DataFrame(columns=['movieId','genres'])for x in movies_genres_df['movieId'].tolist():
movies_genres=movies_genres.append(get_movie_genres(x))movies_genres.head()
users_genres
We use “users_genres” to get a relation between “Users” and “Genres” nodes. We need to find favorite movie genre for each users. To do this we need to analyze all movies which are watched by users and get genres of the movies then we can extract the favorite one which is using counts of the movie genres.
#join to movies data to get genre information
user_genres_df = ratings_data.merge(movies_data, left_on='movieId', right_on='movieId', how='left')#drop columns that will not be used
user_genres_df.drop(['movieId','rating','timestamp','title'], axis = 1, inplace=True)user_genres_df.head()
We need to define a function to obtain favorite movie genre for each users. It is based on counts of movie genres.
def get_favorite_genre(userId):
user = user_genres_df[user_genres_df['userId']==userId]
genres = user['genres'].tolist()
movie_list = [b for a in [i.split('|') for i in genres] for b in a]
counter = Counter(movie_list)
return counter.most_common(1)[0][0]
We need to run a for loop to extract favorite movie genre for each users
#create empty df
users_genres = pd.DataFrame(columns=['userId','genre'])for x in user_df['userId'].tolist():
users_genres = users_genres.append(pd.DataFrame([[x,get_favorite_genre(x)]], columns=['userId','genre']))users_genres.head()
6- Import data to neo4j
We create 7 datasets so we are ready to create recommendation database in neo4j. First let’s export our datasets as csv files.
users_df.to_csv('users.csv', sep='|', header=True, index=False)movies_df.to_csv('movies.csv', sep='|', header=True, index=False)genres_df.to_csv('genres.csv', sep='|', header=True, index=False)users_movies_df.to_csv('users_movies.csv', sep='|', header=True, index=False)movies_genres.to_csv('movies_genres.csv', sep='|', header=True, index=False)users_genres.to_csv('users_genres.csv', sep='|', header=True, index=False)movies_similarity.to_csv('movies_similarity.csv', sep='|', header=True, index=False)
We use neo4j as a graph database. I would like to give some information about neo4j.
Neo4j is a graph database management system developed by Neo4j, Inc. Described by its developers as an ACID-compliant transactional database with native graph storage and processing, Neo4j is the most popular graph database according to DB-Engines ranking, and the 22ⁿᵈ most popular database overall.(wikipedia)
You can download neo4j from here (https://neo4j.com/download/). It is easy to setup and start to use it. Its UI is very user friendly and there are a lot of documentations.
To create a new database, clik to “Add Graph” field. We create recommendation database. After create new database we need to change a setting to import files from all folders. To do this click to “Manage” field
Then go to “Settings” field and comment out to “dbms.directories.import=import”.
It looks like we are ready. Start the database and you will see it in the “Active database” field. Click to “Open Browser” to work on the database.
Let’s start to load nodes first
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///users.csv” AS row
FIELDTERMINATOR ‘|’
CREATE (:Users {userId: row.userId});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///movies.csv” AS row
FIELDTERMINATOR ‘|’
CREATE (:Movies {movieId: row.movieId, title: row.title, rating_mean: row.rating_mean});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///genres.csv” AS row
FIELDTERMINATOR ‘|’
CREATE (:Genres {genres: row.genres});
It is good to create indexes before create relationships
CREATE INDEX ON :Users(userId);CREATE INDEX ON :Movies(movieId);
Nodes are ready. It is time create relationships though nodes. We have 4 relationships.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///users_movies.csv” AS row
FIELDTERMINATOR ‘|’
MATCH (user:Users {userId: row.userId})
MATCH (movie:Movies {movieId: row.movieId})
MERGE (user)-[:WATCHED {rating: row.rating}]->(movie);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///users_genres.csv” AS row
FIELDTERMINATOR ‘|’
MATCH (user:Users {userId: row.userId})
MATCH (genres:Genres {genres: row.genre})
MERGE (user)-[:FAVORITE]->(genres);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///movies_genres.csv” AS row
FIELDTERMINATOR ‘|’
MATCH (movie:Movies {movieId: row.movieId})
MATCH (genres:Genres {genres: row.genres})
MERGE (movie)-[:GENRES]->(genres);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM “file:///movies_similarity.csv” AS row
FIELDTERMINATOR ‘|’
MATCH (movie1:Movies {movieId: row.movieId})
MATCH (movie2:Movies {movieId: row.sim_moveId})
MERGE (movie1)-[:SIMILAR {relevance: row.relevance}]->(movie2);
It is done! We create recommendation database so we can start to work on it. We have 165790 nodes and 20329354 relationships
7- Query for recommendation
We use Cpyher Query Language to extract data from neo4j database.
Cypher is a declarative graph query language that allows for expressive and efficient querying and updating of a property graph. Cypher is a relatively simple but still very powerful language. Very complicated database queries can easily be expressed through Cypher.(wikipedia)
Let’s review a user and check movies which are watched by him
MATCH path = (u:Users)-[:WATCHED]->(m1:Movies)
WHERE u.userId =~’4'
RETURN u.userId, m1.title, m1.rating_mean
MATCH path = (u:Users)-[:FAVORITE]->(g:Genres)
WHERE u.userId =~’4'
RETURN u.userId, g.genres
He watched 28 movies and his favorite movie genre is ‘Action’. Let’s find similar movies based on the movies which are watched by him
MATCH path = (u:Users)-[:WATCHED]->(m1:Movies)-[s:SIMILAR]->(m2:Movies)
WHERE u.userId =~’4'
RETURN u.userId, m1.title, m2.title, m2.rating_mean
There are 140 movies (28*5). But we are interesting in only ‘Action’ movies
MATCH path = (u:Users)-[:WATCHED]->(m1:Movies)-[s:SIMILAR]->(m2:Movies),
(m2)-[:GENRES]->(g:Genres),
(u)-[:FAVORITE]->(g)
WHERE u.userId =~’4'
RETURN u.userId, g.genres, m1.title, m2.title, m2.rating_mean
We are almost there we need to filter movies which are already watched then sort them based on ratings and get 5 of them
MATCH (u1:Users)-[:WATCHED]->(m3:Movies)
WHERE u1.userId =~’4'
WITH [i in m3.movieId | i] as movies
MATCH path = (u:Users)-[:WATCHED]->(m1:Movies)-[s:SIMILAR]->(m2:Movies),
(m2)-[:GENRES]->(g:Genres),
(u)-[:FAVORITE]->(g)
WHERE u.userId =~’4' and not m2.movieId in movies
RETURN distinct u.userId as userId, g.genres as genres,
m2.title as title, m2.rating_mean as rating
ORDER BY m2.rating_mean descending
LIMIT 5
That’s it. We can use this query to recommend a movie to user. Let’s get recommendations for another user
MATCH path = (u:Users)-[:WATCHED]->(m1:Movies)
WHERE u.userId =~’1905'
RETURN u.userId, m1.title, m1.rating_mean
MATCH (u1:Users)-[:WATCHED]->(m3:Movies)
WHERE u1.userId =~’1905'
WITH [i in m3.movieId | i] as movies
MATCH path = (u:Users)-[:WATCHED]->(m1:Movies)-[s:SIMILAR]->(m2:Movies),
(m2)-[:GENRES]->(g:Genres),
(u)-[:FAVORITE]->(g)
WHERE u.userId =~’1905' and not m2.movieId in movies
RETURN distinct u.userId as userId, g.genres as genres,
m2.title as title, m2.rating_mean as rating
ORDER BY m2.rating_mean descending
LIMIT 5
I think it is done! We are ready to recommend a movie based on user’s last behaviors