How to create recommendation engine in neo4j

Published in

LARUS

6 min readJul 28, 2020

In this article, I’m going to illustrate my experience building recommendation algorithms with Neo4j.

What is Recommendation engine?

Recommendation algorithms are widely used, especially in social networks and e-commerce.
For instance, in social networks, there is always a dedicated section where we receive suggestions regarding pages to follow, people to connect with, and so on. The same applies to e-commerce. Often, while we are looking for a product, we find a section containing items similar to what we are viewing at that moment.
Behind all that, there are the recommendation algorithms.

From zero to recommendation engine in Neo4j

Neo4j also provides the infrastructure to build recommendation algorithms. In particular, my experience focuses on two main groups: Content-Based Filtering & Collaborative-Filtering.

Content-Based Filtering: Recommend items that are similar to those that a user is viewing, rated highly or purchased previously.
Collaborative-Filtering: Use the preferences, ratings and actions of other users in the network to find items to recommend.

Both groups use similarity indices belonging to neo4j’s Graph Data Science Library to choose how and what recommend.

The Neo4j Graph Data Science Library, it’s useful to make valuable predictions, such as pinpointing interactions that indicate fraud, identifying similar entities or individuals, finding the most influential elements in patient or customer journeys, and how to ameliorate the spread of IT or phone outages.
Data scientists benefit from a customized, flexible data structure for global computations and a repository of powerful, robust algorithms to quickly compute results over tens of billions of nodes.

If you want to go more in detail on the topic, I leave you the link of GitHub repository: https://github.com/neo4j/graph-data-science.

We are going to explore both use cases, firstly Content-Based Filtering and lately Collaborative Filtering.

Data structure

Before starting, we briefly describe how the data structure on which we will create the algorithms is formed.
We have three types of nodes:
- Users(Red node);
- TV Shows(Grey node);
- Categories(Yellow node).
In detail:
- A Show can have one or more categories
- A User can watch one or more shows.
So the relationships between the nodes we’re going to create are:
- WATCH between a user and a show;
- HAS_CATEGORY between a show and a category.

You can find the dataset that I used for this example at the following link:
https://github.com/LeonardoMarrancone/recommendation-engine-neo4j
To import the dataset:
- click the link above;
- on the GitHub page, click the green button “CODE”;
- click Download Zip;
- open the folder, and copy the dataset to your neo4j folder, inside the import directory(as in the following image);

- restart neo4j;
- run the following query in order in which I insert them:

https://gist.github.com/LeonardoMarrancone/964e3cb77c726d763a9f34e5615dd339

https://gist.github.com/LeonardoMarrancone/73f53b44f351379f59a0f78a044cd858

https://gist.github.com/LeonardoMarrancone/efae364283e29013a6e09d7f0894f60b

https://gist.github.com/LeonardoMarrancone/1da927aa9e40a3affa87d31faa0ecbb9

https://gist.github.com/LeonardoMarrancone/034f890727a38ae9f43a56a442ff0712

Content-Based Filtering

This type of algorithm allows us to search and suggest contents that are very similar to the ones the user likes (that’s why Content-Based).
In this specific case, we are going to advise the user to watch a new TV show, which he has never watched.
To achieve this goal, we will use an index inside neo4j’s Graph Data Science(from now GSDL) called Jaccard.

The Jaccard index is a number between 0 and 1 that indicates how similar two sets are. The Jaccard index of two identical sets is 1. If two sets do not have a common element, the Jaccard index is 0. The Jaccard is calculated by dividing the size of the intersection of two sets by the union of the two sets.

First of all, we search shows the user has seen, extracting the category. This category is used to find shows the user has never seen. After that, we create two sets based on the categories linked to the previously selected shows.
At this point, we can compare the two sets using the Jaccard similarity index.
The result is ordered by the Jaccard attribute.
Below the query implementing all the Jaccard steps.

https://gist.github.com/LeonardoMarrancone/c38eaa0fdbc4d5e409386dd4305af1d9

And we get the following result:

The same result can be obtained simply using the Jaccard function included in the neo4j’s GDSL.

https://gist.github.com/LeonardoMarrancone/7ea5620397c907beda95846b688bc47b

As expected, the algorithm returns as a first pick a show with the same category of the shows the user watches. This means that is very likely the user will enjoy the recommended shows, because they belong to the same topic the user has already watched.

This was an example of a Content-Based Filtering algorithm.

Collaborative-Filtering

As its definition says at the beginning of this article, this type of algorithm is not based on contents, it analyzes how similar two users are.
For this example, we use another similarity index from GDSL, called Cosine Similarity. The value of this index fluctuates between -1 and 1. If the returned value is -1 it means that the two users are totally different, while if the return value is 1 the two users are very close to each other.
To compare two user by the Cosine Similarity, we build two vector that “represents” the users.
One possible implementation is to create an embedding for each user (meaning of word “embedding” https://en.wikipedia.org/wiki/Embedding).
In this example, we create an item-based algorithm.
First of all, we collect all the shows in an array(shows). We look for the shows viewed from the selected user and save them in another array(my shows). After that we search the shows viewed from all the other users, splitting them in an array for each user(othershows). Then we apply a GDSL function called oneHotEconding(create an array where it inserts 1 if the content appears in both arrays, 0 otherwise) to shows and myshows, saving the results in embedding. The same function is applied to shows and othershows, storing the results in embedding2. Eventually, we compare embedding and embedding 2, with the Cosine Similarity.
We use the oneHotEconding because both array must be the same size for the Cosine Similarity work.

Below the query implementing the Cosine Similarity.

https://gist.github.com/LeonardoMarrancone/39e78f82adad5bd94eab7108f6bc8f8c

And the results are:

These results are provided, giving to all relationship the same weight.
This example does not represent a real use case.
We added one property, called “rating”, to the WATCH relationship, to improve and make it a real use case.
This property is a value between 0 and 5 and represents how much the user liked the show.
We used this new property in the following query, where we compare the selected user and the other user, according to the rate given at the show.
In this query, we implement all the steps for the cosine similarity.

https://gist.github.com/LeonardoMarrancone/ceffa166a0480ea85f6fe85519fd5015

And the results are:

The same result can be obtained simply using the cosine function included in the neo4j’s GDSL.

https://gist.github.com/LeonardoMarrancone/3ad97000a2e7a32c67cb152064f2d856

Thanks to this new property, we can use another similarity index, called Pearson. The Pearson index considers the delta of the averages rating given from the users, to calculate the recommendation.
Like in the other cases, we have two methods to calculate the Pearson index.

https://gist.github.com/LeonardoMarrancone/433e6868c750d80afbb35a8c39c31596

https://gist.github.com/LeonardoMarrancone/002d757d0e825231f1ef0f1e8c5735c7

And the results are: