Building a Recommendation Engine with Neo4j: Part 1 Simple graph based approaches

PaddleSoft
Mar 25, 2016 · 5 min read

This is part one in our five part series on building a recommendation engine. In this issue we will discuss some simple out of the box techniques that utilize basic graph algorithms. This article will rely heavily on Cypher Query Language or CQL so if you are not familiar with it you can read up on it here: http://neo4j.com/docs/stable/preface.html .

So we will now jump right into building our graph recommendation engine. For the purposes of this project I created a small test data set containing about 12 nodes (we’ll expand this as the project goes on). The nodes in green are rivers and the nodes in blue are people. People are connected by the friendship relationship (i.e. a directed edge between two members of the set of people) and people are connected to rivers based whether they have paddled the river or not.

Our current data set that contains both river nodes (green) and people nodes (light blue)

A simple way to find rivers that users might be interested in paddling (or rivers they have already paddled but have not saved on PaddleSoft) is by creating a query that finds all the user’s friends and then finds the rivers they have paddled. For instance let’s find rivers that Jeff may want to paddle:

*Please note, the formatting of Medium unfortunately, does not preserve format of the “” and ‘’ in queries. So if you try to copy them directly into Neo4j it will raise syntax errors.

MATCH (n {full_name:“Jeff”})-[:friend]->(m)-[:has_paddled]->(s) RETURN s

This will return:

Rivers that friends of Jeff have paddled

However, (as you probably noticed) this returns rivers that Jeff has already paddled like the Penobscot (or Penobs in the graphic). So we need a query that weeds these out. To do this we have to check to see if a relationship exists between the two using the WHERE not condition.

MATCH (n {full_name:“Jeff”})-[:friend]->(m)-[:has_paddled]->(s) WHERE not (n)-[:has_paddled]-(s) RETURN s

This will remove the Penobscot from the results and give us only rivers his friends have paddled but he has not.

Lets now move on to recommending friends based on similar rivers paddled. Since most people in the dataset were already “friends” I decided to add an additional person node (i.e. Dave) onto the dataset.

Dataset with Dave added. Dave has paddled both the Ottawa and the Charles.

Okay so both Natalia and Dave have several paddled rivers but they have no friends. So we will now recommend some potential paddling partners to them. Same logic as before:

MATCH (n {full_name:”Dave”})-[:has_paddled]->(r)-[:has_paddled]-(friend_rec) RETURN friend_rec

Here we see the results (or recommendations) from the query. Natalia is recommended since they have both paddled the Ottawa river and Leila because they have both paddled the Charles.

MATCH (n {full_name:”Natalia”})-[:has_paddled]->(r)-[:has_paddled]-(friend_rec) RETURN r

Possible friends for Natalia. Neo4j in the UI also shows the friendship connections between Jeff, Charles, and Kailey.

Since Natalia and Dave have no existing friends we do not need the WHERE statement (and by not having it we actually save time), but for someone like Jeff we will need to add the WHERE not (n)-[:friend]-(friend_rec) statement.

More advanced queries

So let’s recommend friends based both on users they are friends with and rivers they have paddled. Unfortunately, we will have to expand our data set further to test this fully. So I added a new person node “Ken” and made him a friend of Jeff’s and I also added Charles as paddling the Rouge. In case you find this confusing I added

one new person node “Ken”

a friend relationship between Ken and Jeff

added a has_paddled relationship between Charles and the Rouge river.

Okay now that we have a larger data set to test this on let’s tryout our new idea:

MATCH (n {full_name: “Charles”})-[:has_paddled]->(r)-[:has_paddled]-(friend_rec) RETURN friend_rec UNION MATCH (n {full_name: “Charles”})-[:friend]-(t)-[:friend]-(friend_rec) WHERE not (n)-[:friend]-(friend_rec) RETURN friend_rec

Exactly what we wanted. Ken is returned because he is a mutual friend of Jeff but not of Charles. Natalia and Dave are returned because of similar rivers that they have paddled.

This is good but what if we could return ranked results based on the number of connections. Specifically, we want a system that gives the results with more rivers in common and more friends in common a higher rank.

MATCH (n {full_name: “Charles”})-[:has_paddled]->(r)-[c:has_paddled]-(friend_rec) RETURN count(c) as count, friend_rec UNION MATCH (n{full_name: “Charles”})-[:friend]-(t)-[s:friend]-(friend_rec) WHERE not (n)-[:friend]-(friend_rec) RETURN count(s) as count, friend_rec ORDER BY count

Now we cannot see the rankings in the normal graphical UI we’ve been using above, however, the query does indeed return the correct results.

Node count

Ken 1

Dave 1

Natalia 2

So that about sums up Part 1 of our five part series. We are off to a good start however, our system is still very much incomplete. Some problems with this system are that it rates all relationships equally and does not factor in location. For instance, someone could have paddled a river once and the relationship will be weighted the same as a river they paddle everyday. Or someone could be friends with both someone they met once at a paddling festival and someone they paddle with all the time, yet they will have the same weight in the recommendation algorithm. In addition, someone could live in the North Carolina and have a lot friends with someone in Oregon and as a result, the majority of the recommendations will be for western rivers.

In Part 4 we will discuss some more sophisticated graph algorithms such as page rank, forests, and weighted edges. However, many of these require setting up Hadoop with Spark (which is a long task in and of itself). Our goal in this part was to show some simple (non-math intensive) out of the box techniques for recommending potential paddling partners and rivers. In Part 2 which will be published in a few weeks (plus or minus) we will discuss collaborative filtering algorithms and how use them with Neo4j.

We will have the code to recreate our test database at the following link shortly:

Remember to follow us on Medium and like us on Facebook. We hope you have a great weekend on the water!

PaddleSoft

Written by

paddlesoft.net

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade