Leveraging Movie Ratings for Recommendation Using Neo4j’s Graph Database

Published in

Data Growing

7 min readMay 6, 2019

From importing text data into Neo4j’s graph database to applying some useful graph theories to determine its valuable hidden meanings in MovieLens’ ratings dataset.

Introduction

This article is a part of DPU’s BD 517 Social and Information Network Analysis class’s assignment. One of the course’s topic is network analysis. Since generating personalized recommendations is one of the most common use cases for a graph database, so I decided to write this article to explore more about the topic.

It would be impossible to complete this without our instructor, Eakasit Pacharawongsakda, who always encourages his students to perform their best. Thanks for being a good mentor and guiding me on the right way.

What You will Learn

How to import text data into Neo4j’s graph database
Simple network analysis using collaborative filtering
Calculate and create new cosine similarity relationships to the existing graph

The Property Graph Model

The data model of graph databases is called the labeled property graph model.

Nodes: The entities in the data.
Labels: Each node can have one or more label that specifies the type of the node.
Relationships: Connect two nodes. They have a single direction and type.
Properties: Key-value pair properties can be stored on both nodes and relationships.

Benefits of using graphs to generate recommendations include:

Performance. Index-free adjacency allows for calculating recommendations in real time, ensuring the recommendation is always relevant and reflecting up-to-date information.
Data model. The labeled property graph model allows for easily combining datasets from multiple sources, allowing enterprises to unlock value from previously separated data silos.

Network Repository

Network Repository is an interactive scientific network data repository. There are tons of graph data ready to be used in network exploration and researches. I chose MovieLens’s 10M rating data from miscellaneous network category to be examined in this assignment.

Understanding MovieLens’s Data

MovieLens’s 10M rating network data consists of 10 millions records as stated in its name. Although the file’s extension is .edges, it is actually an edge list in plain text. There are 4 columns stored in this network which are user’s ID, movie’s ID, rating, and timestamp. The first 5 rows of the data are shown below:

1,1,5,838985046
1,2,5,838983525
1,3,5,838983392
1,4,5,838983421
1,5,5,838983392
…

It should be noted that both user’s and movie’s IDs are integer, while rating is a float number from 1.0 to 5.0 (yes, there are such 1.5, 2.5, 3.5, and 4.5 in this column). The timestamp can be useful in some other applications such as a specific period of time analysis, but it will not be the major interesting feature here.

Nodes: User and Movie are the labels used in this dataset.
Relationship: RATED is the relationship when a user rated a movie.
Properties: userID, movieID, rating and timestamp are also imported as properties of User’s and Movie’s nodes and RATED’s relationship.

For the sake of simplicity, I will only used the first 20,000 rows of the data for network analysis in this topic.

Intro to Cypher

In order to work with our labeled property graph, we need a query language for graphs called ‘Cypher’. You can use the Cypher Refcard for easy reference when dealing with Cypher syntax.

Graph Patterns

Cypher is the query language for graphs and centered around graph patterns. Graph patterns are expressed in Cypher using ASCII-art like syntax.

Nodes: Nodes are defined within parentheses (). Optionally, we can specify node label(s): (:Movie)
Relationships: Relationships are defined within square brackets []. Optionally we can specify type and direction: (:User)-[:RATED]->(:Movie)
Aliases: Graph elements can be bound to aliases that can be referred to later in the query: (u:User)-[r:RATED]->(m:Movie)

Importing Data into Neo4j Desktop

Open Neo4j Desktop application. Select existing project or create new one to start. Click ‘Add Graph’ to create new graph database and select ‘Create a Local Graph’. Enter graph name and password then press ‘Create’ button.

Figure 1: Creating and starting a Neo4j’s database instance.

Press ‘Manage’ button to configure the database instance. Then expand the ‘Import’ menu from ‘Open Folder’ button under the instance’s name. Finder (macOS) or Windows Explorer (Windows) window will appear and this folder is where we will store our edge list file (movielens-20k_rating.edges). Move the file into the opened import folder so we can later refer to it when importing to Neo4j’s graph database.

Figure 2: Open Import Folder to store file to be imported by Neo4j’s cypher.

Press ‘Open Browser’ button next the ‘Open Folder’ button to navigate to Neo4j’s Browser windows. Import the ratings file using below cypher. Make sure that the returned result shows count(rel) of 20,000 relationships.

LOAD CSV FROM ‘file:///movielens-20k_rating.edges’ AS rowWITH toInteger(row[0]) AS userId, toInteger(row[1]) AS movieId, toFloat(row[2]) AS rating, toInteger(row[3]) AS timeStampMERGE (u:User {userId: userId})MERGE (m:Movie {movieId: movieId})MERGE (u)-[rel:RATED {rating: rating, timeStamp: timeStamp}]->(m)RETURN count(rel)

Preview our imported graph using below cypher. The result should be similar to Figure 3. You can adjust node’s and edge’s colors by clicking Movie or User labels. Edges (or relationships) can be configured to display relationship’s type (RATED in this sample data) or each relationship’s rating score as shown in below figure. In Neo4j, these values described nodes or edges are also known as properties.

MATCH (m:Movie)<-[r:RATED]-(u:User) RETURN m,r,u LIMIT 100

Figure 3: Sample graph shows User’s nodes (blue) and Movie’s nodes (purple) with their relationships (rating scores).

Collaborative Filtering

Given a movieId, we can find a set of movies to recommend another users using below Cypher. The query simply means: “users who watched (rated) this movie (the movie with movieId 1), also watched other movies.”

MATCH (m:Movie {movieId: 1})<-[:RATED]-(u:User)-[:RATED]->(rec:Movie)RETURN rec.movieId AS recommendation, COUNT(*) AS usersWhoAlsoWatchedORDER BY usersWhoAlsoWatched DESC LIMIT 25

Figure 4: Result of movie recommendation using collaborative filtering.

Graphically, we can display the result using nodes and relationships using this Cypher:

MATCH (m:Movie {movieId: 1})<-[:RATED]-(u:User)-[:RATED]->(rec:Movie)
RETURN m,u,rec

Figure 5: Nodes and relationships showing movie recommendation using preferences from other users. Red arrows points to the movie node with **movieId** of 1, which is the node we are considering in this query.

Cosine Similarity

Cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It is the dot product of the two vectors divided by the product of the two vectors’ lengths (or magnitudes). For two vectors A and B in an n-dimensional space:

Cosine similarity ranges between -1 and 1, where -1 is perfectly dissimilar and 1 is perfectly similar.

With movie ratings dataset, each relationship has a weight (rating) that we can consider as well. The cosine similarity of two users indicates how similar these two users’ preferences for movies are. Users with high cosine similarity will likely have similar movies in their libraries or purchase histories.

From the result of collaborative filtering above (see Figure 5), I noticed that user 1 and user 137 have some similar movies watched as their preferences. Let consider how similar they are using their movie ratings. To find their ratings of the same movie they rated, enter the Cypher below:

MATCH (p1:User {userId: 1})-[r1:RATED]->(m:Movie)<-[r2:RATED]-(p2:User {userId: 137})RETURN m.movieId AS Movie, r1.rating AS `u1’s Rating`, r2.rating AS `u137’s Rating`

We can represent their rating vectors where their coordinates are defined by their movie ratings as:

A = user1’s rating vector = ⟨5.0, 5.0, 5.0, 5.0, 5.0, 5.0⟩
B = user137’s rating vector = ⟨2.5, 1.5, 4.0, 2.5, 3.0, 1.5⟩

Similarity(A,B) = 0.94491118252

Adding Cosine Similarity to the Graph

To create a [:SIMILARITY] relationship between each person in the graph, where their cosine similarity is a property of the relationship. Enter this Cypher:

MATCH (u1:User)-[x:RATED]->(m:Movie)<-[y:RATED]-(u2:User)
WITH  SUM(x.rating * y.rating) AS xyDotProduct,
      SQRT(REDUCE(xDot = 0.0, a IN COLLECT(x.rating) | xDot + a^2)) AS xLength,
      SQRT(REDUCE(yDot = 0.0, b IN COLLECT(y.rating) | yDot + b^2)) AS yLength,
      u1, u2
MERGE (u1)-[s:SIMILARITY]-(u2)
SET   s.similarity = xyDotProduct / (xLength * yLength)

The Cypher should return a completion like:

Set 21370 properties, created 10685 relationships, completed after 3856 ms.

Let’s check the cosine similarity between user1 and user137 we created to compare with our calculation before with this Cypher:

MATCH (u1:User {userId: 1})-[s:SIMILARITY]-(u2:User {userId: 137})
RETURN s.similarity AS `Cosine Similarity`

To see samples of how each user connected to each other with their similarities, use this Cypher to query any users with cosine similarities more than 0.8 next to user 1.

Noted that, without the LIMIT clause, it would take a significant time to display every single node connected to the node we are considering.

MATCH (u1:User {userId: 1})-[s:SIMILARITY]-(u2:User)
WHERE s.similarity > 0.8
RETURN u1,s,u2 LIMIT 5